├── .gitignore ├── LICENSE.txt ├── README.md ├── data ├── download_edgebox_proposals.sh ├── metadata │ └── .keep ├── split │ ├── referit_all_imlist.txt │ ├── referit_test_imlist.txt │ ├── referit_train_imlist.txt │ ├── referit_trainval_imlist.txt │ └── referit_val_imlist.txt ├── training │ └── .keep └── vocabulary.txt ├── datasets ├── ReferIt │ ├── ImageCLEF │ │ └── .keep │ └── ReferitData │ │ └── .keep ├── download_kitchen_dataset.sh └── download_referit_dataset.sh ├── demo ├── demo_data │ ├── 40429.jpg │ └── 40429.txt └── retrieval_demo.ipynb ├── exp-kitchen ├── cache_kitchen_training_batches.py ├── caffemodel │ └── .keep ├── test_scrc_on_kitchen.py └── train_scrc_kitchen.sh ├── exp-referit ├── cache_referit_context_features.py ├── cache_referit_training_batches.py ├── caffemodel │ └── .keep ├── initialize_weights_scrc_full.py ├── initialize_weights_scrc_no_context.py ├── preprocess_dataset.py ├── test_scrc_on_referit.py ├── train_scrc_full_on_referit.sh └── train_scrc_no_context_on_referit.sh ├── external └── download_caffe.sh ├── models └── download_trained_models.sh ├── prototxt ├── VGG_ILSVRC_16_layers_deploy.prototxt ├── coco_pretrained.prototxt ├── scrc_full_vgg_buffer_50.prototxt ├── scrc_full_vgg_solver.prototxt ├── scrc_kitchen_buffer_50.prototxt ├── scrc_kitchen_solver.prototxt ├── scrc_no_context_vgg_buffer_50.prototxt ├── scrc_no_context_vgg_solver.prototxt ├── scrc_word_to_preds_full.prototxt ├── scrc_word_to_preds_no_context.prototxt └── scrc_word_to_preds_no_spatial_no_context.prototxt ├── retriever.py └── util ├── __init__.py └── io.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.mat 2 | *.h5 3 | *.pyc 4 | *-checkpoint.ipynb 5 | *~ 6 | *.swp 7 | 8 | *.zip 9 | *.tar.gz 10 | 11 | datasets/ReferIt/* 12 | datasets/Kitchen/* 13 | 14 | data/* 15 | external/* 16 | 17 | *.caffemodel 18 | *.solverstate 19 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | UC Berkeley's Standard Copyright and Disclaimer Notice: 2 | 3 | Copyright (c) 2016. The Regents of the University of California (Regents). All 4 | Rights Reserved. Permission to use, copy, modify, and distribute this software 5 | and its documentation for educational, research, and not-for-profit purposes, 6 | without fee and without a signed licensing agreement, is hereby granted, 7 | provided that the above copyright notice, this paragraph and the following 8 | two paragraphs appear in all copies, modifications, and distributions. 9 | Contact The Office of Technology Licensing, UC Berkeley, 2150 Shattuck Avenue, 10 | Suite 510, Berkeley, CA 94720-1620, (510) 643-7201, for commercial licensing 11 | opportunities. 12 | 13 | Ronghang Hu, University of California, Berkeley. 14 | 15 | IN NO EVENT SHALL REGENTS BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, 16 | INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF 17 | THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF REGENTS HAS BEEN 18 | ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 19 | 20 | REGENTS SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, 21 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. 22 | THE SOFTWARE AND ACCOMPANYING DOCUMENTATION, IF ANY, PROVIDED HEREUNDER IS 23 | PROVIDED "AS IS". REGENTS HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, 24 | UPDATES, ENHANCEMENTS, OR MODIFICATIONS. 25 | 26 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Natural Language Object Retrieval 2 | This repository contains the code for the following paper: 3 | 4 | * R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, T. Darrell, *Natural Language Object Retrieval*, in Computer Vision and Pattern Recognition (CVPR), 2016 ([PDF](http://arxiv.org/pdf/1511.04164)) 5 | ``` 6 | @article{hu2016natural, 7 | title={Natural Language Object Retrieval}, 8 | author={Hu, Ronghang and Xu, Huazhe and Rohrbach, Marcus and Feng, Jiashi and Saenko, Kate and Darrell, Trevor}, 9 | journal={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition}, 10 | year={2016} 11 | } 12 | ``` 13 | 14 | Project Page: http://ronghanghu.com/text_obj_retrieval 15 | 16 | ## Installation 17 | 1. Download this repository or clone with Git, and then `cd` into the root directory of the repository. 18 | 2. Run `./external/download_caffe.sh` to download the SCRC Caffe version for this experiment. It will be downloaded and unzipped into `external/caffe-natural-language-object-retrieval`. This version is modified from the [Caffe LRCN implementation](http://jeffdonahue.com/lrcn/). 19 | 3. Build the SCRC Caffe version in `external/caffe-natural-language-object-retrieval`, following the [Caffe installation instruction](http://caffe.berkeleyvision.org/installation.html). **Remember to also build pycaffe.** 20 | 21 | ## SCRC demo 22 | 1. Download the pretrained models with `./models/download_trained_models.sh`. 23 | 2. Run the SCRC demo in `./demo/retrieval_demo.ipynb` with [Jupyter Notebook (IPython Notebook)](http://ipython.org/notebook.html). 24 | 25 | ![Image](http://www.eecs.berkeley.edu/~ronghang/projects/cvpr16_text_obj_retrieval/scrc_demo.jpg) 26 | 27 | ## Train and evaluate SCRC model on ReferIt Dataset 28 | 1. Download the ReferIt dataset: `./datasets/download_referit_dataset.sh`. 29 | 2. Download pre-extracted EdgeBox proposals: `./data/download_edgebox_proposals.sh`. 30 | 3. You may need to add the SRCR root directory to Python's module path: `export PYTHONPATH=.:$PYTHONPATH`. 31 | 4. Preprocess the ReferIt dataset to generate metadata needed for training and evaluation: `python ./exp-referit/preprocess_dataset.py`. 32 | 5. Cache the scene-level contextual features to disk: `python ./exp-referit/cache_referit_context_features.py`. 33 | 6. Build training image lists and HDF5 batches: `python ./exp-referit/cache_referit_training_batches.py`. 34 | 7. Initialize the model parameters and train with SGD: `python ./exp-referit/initialize_weights_scrc_full.py && ./exp-referit/train_scrc_full_on_referit.sh`. 35 | 8. Evaluate the trained model: `python ./exp-referit/test_scrc_on_referit.py`. 36 | 37 | Optionally, you may also train a SCRC version without contextual feature, using `python ./exp-referit/initialize_weights_scrc_no_context.py && ./exp-referit/train_scrc_no_context_on_referit.sh`. 38 | 39 | ## Train and evaluate SCRC model on Kitchen Dataset 40 | 1. Download the Kitchen dataset: `./datasets/download_kitchen_dataset.sh`. 41 | 2. You may need to add the SRCR root directory to Python's module path: `export PYTHONPATH=.:$PYTHONPATH`. 42 | 3. Build training image lists and HDF5 batches: `python exp-kitchen/cache_kitchen_training_batches.py`. 43 | 4. Train with SGD: `./exp-kitchen/train_scrc_kitchen.sh`. 44 | 5. Evaluate the trained model: `python exp-kitchen/test_scrc_on_kitchen.py`. 45 | -------------------------------------------------------------------------------- /data/download_edgebox_proposals.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | wget -O ./data/referit_edgeboxes_top100.zip http://www.eecs.berkeley.edu/~ronghang/projects/cvpr16_text_obj_retrieval/referit_edgeboxes_top100.zip 3 | unzip ./data/referit_edgeboxes_top100.zip -d ./data/ 4 | -------------------------------------------------------------------------------- /data/metadata/.keep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ronghanghu/natural-language-object-retrieval/c6ddd5d78e9d4d886abc20d4e1b4421b3795a89e/data/metadata/.keep -------------------------------------------------------------------------------- /data/split/referit_val_imlist.txt: -------------------------------------------------------------------------------- 1 | 26953 2 | 9849 3 | 27582 4 | 7599 5 | 10917 6 | 12784 7 | 6834 8 | 15896 9 | 10784 10 | 11446 11 | 8010 12 | 8900 13 | 18817 14 | 1319 15 | 10398 16 | 15594 17 | 14964 18 | 37838 19 | 8588 20 | 32036 21 | 22291 22 | 51 23 | 20277 24 | 2983 25 | 39011 26 | 9962 27 | 7336 28 | 16305 29 | 37903 30 | 13424 31 | 19975 32 | 22642 33 | 37667 34 | 37167 35 | 19456 36 | 15833 37 | 32716 38 | 9152 39 | 10334 40 | 11837 41 | 39153 42 | 2065 43 | 13066 44 | 31689 45 | 30188 46 | 31155 47 | 37681 48 | 25207 49 | 13295 50 | 30235 51 | 7369 52 | 12012 53 | 32356 54 | 31636 55 | 1957 56 | 39261 57 | 4117 58 | 31085 59 | 20953 60 | 35682 61 | 7544 62 | 8169 63 | 30938 64 | 7136 65 | 26920 66 | 3086 67 | 13854 68 | 8704 69 | 30883 70 | 35578 71 | 6764 72 | 22860 73 | 30192 74 | 17903 75 | 38048 76 | 8834 77 | 37313 78 | 6956 79 | 4993 80 | 18376 81 | 844 82 | 10111 83 | 799 84 | 9636 85 | 37220 86 | 37675 87 | 807 88 | 19128 89 | 15889 90 | 39980 91 | 31974 92 | 9370 93 | 31788 94 | 5003 95 | 38971 96 | 12601 97 | 18080 98 | 4828 99 | 35615 100 | 18329 101 | 7740 102 | 37147 103 | 5011 104 | 10285 105 | 40402 106 | 24027 107 | 39468 108 | 8209 109 | 30314 110 | 32284 111 | 14784 112 | 929 113 | 3593 114 | 25239 115 | 3908 116 | 2948 117 | 16330 118 | 37776 119 | 14840 120 | 23339 121 | 22801 122 | 30378 123 | 7929 124 | 8316 125 | 11063 126 | 40553 127 | 31068 128 | 18113 129 | 18494 130 | 37853 131 | 32540 132 | 37234 133 | 10603 134 | 8384 135 | 1842 136 | 6323 137 | 22994 138 | 30521 139 | 19114 140 | 30968 141 | 19642 142 | 8208 143 | 35863 144 | 27662 145 | 30504 146 | 15700 147 | 27689 148 | 38704 149 | 32583 150 | 2993 151 | 25377 152 | 30073 153 | 6283 154 | 9413 155 | 30234 156 | 32206 157 | 39139 158 | 21145 159 | 38926 160 | 35943 161 | 21328 162 | 14116 163 | 2658 164 | 17537 165 | 30704 166 | 37715 167 | 3269 168 | 4778 169 | 26020 170 | 19813 171 | 33478 172 | 700 173 | 33374 174 | 31806 175 | 12896 176 | 9684 177 | 31262 178 | 798 179 | 22873 180 | 4313 181 | 2389 182 | 1142 183 | 30367 184 | 22158 185 | 7331 186 | 13857 187 | 4168 188 | 58 189 | 30616 190 | 31607 191 | 18883 192 | 8596 193 | 31019 194 | 32114 195 | 16152 196 | 16342 197 | 20902 198 | 5104 199 | 37284 200 | 12778 201 | 7055 202 | 10010 203 | 9588 204 | 13516 205 | 38217 206 | 16674 207 | 15403 208 | 32577 209 | 40585 210 | 31944 211 | 22644 212 | 10974 213 | 38112 214 | 23014 215 | 39051 216 | 37218 217 | 6869 218 | 15779 219 | 1862 220 | 16886 221 | 30565 222 | 16812 223 | 39753 224 | 37286 225 | 40451 226 | 9015 227 | 10233 228 | 11117 229 | 30231 230 | 20761 231 | 37844 232 | 19565 233 | 13637 234 | 32720 235 | 40101 236 | 19696 237 | 18460 238 | 11345 239 | 12516 240 | 37232 241 | 23730 242 | 19408 243 | 8440 244 | 7008 245 | 3015 246 | 17973 247 | 33420 248 | 4952 249 | 2804 250 | 20163 251 | 11151 252 | 27618 253 | 19698 254 | 8741 255 | 12350 256 | 26025 257 | 31572 258 | 32620 259 | 31147 260 | 40261 261 | 18677 262 | 9749 263 | 4662 264 | 7527 265 | 35841 266 | 39099 267 | 3837 268 | 10158 269 | 6545 270 | 15672 271 | 39740 272 | 8458 273 | 19442 274 | 19779 275 | 6876 276 | 40638 277 | 7161 278 | 37550 279 | 17582 280 | 10890 281 | 12925 282 | 21776 283 | 3277 284 | 7654 285 | 39675 286 | 40428 287 | 9984 288 | 30537 289 | 1149 290 | 32704 291 | 10119 292 | 11709 293 | 40054 294 | 8407 295 | 8845 296 | 10035 297 | 7066 298 | 16355 299 | 32389 300 | 811 301 | 35801 302 | 39669 303 | 21152 304 | 7721 305 | 31508 306 | 37128 307 | 32253 308 | 13017 309 | 19553 310 | 14977 311 | 39074 312 | 31409 313 | 7535 314 | 17047 315 | 59 316 | 10299 317 | 8991 318 | 30644 319 | 14757 320 | 11391 321 | 21319 322 | 32098 323 | 10631 324 | 39949 325 | 32073 326 | 9483 327 | 38690 328 | 17501 329 | 17776 330 | 1394 331 | 16222 332 | 15966 333 | 35885 334 | 2581 335 | 15316 336 | 6406 337 | 6661 338 | 12651 339 | 37868 340 | 38011 341 | 38185 342 | 30779 343 | 9490 344 | 30952 345 | 32080 346 | 10506 347 | 40399 348 | 37441 349 | 10157 350 | 10896 351 | 2386 352 | 37150 353 | 32329 354 | 6981 355 | 16098 356 | 20203 357 | 39909 358 | 38841 359 | 18941 360 | 25191 361 | 30224 362 | 39239 363 | 1003 364 | 40165 365 | 22447 366 | 22911 367 | 30366 368 | 8721 369 | 24277 370 | 14704 371 | 2889 372 | 31498 373 | 8166 374 | 12495 375 | 17026 376 | 5159 377 | 34160 378 | 17725 379 | 19630 380 | 31506 381 | 18753 382 | 14905 383 | 20983 384 | 40461 385 | 32489 386 | 25607 387 | 39971 388 | 14551 389 | 2531 390 | 7039 391 | 4965 392 | 22711 393 | 9251 394 | 10705 395 | 12741 396 | 27460 397 | 40372 398 | 3362 399 | 32203 400 | 11493 401 | 40563 402 | 20595 403 | 19295 404 | 10559 405 | 22605 406 | 10719 407 | 31296 408 | 20864 409 | 970 410 | 21956 411 | 30374 412 | 13416 413 | 30764 414 | 8639 415 | 12388 416 | 5196 417 | 16391 418 | 11401 419 | 11714 420 | 38047 421 | 36042 422 | 12764 423 | 8075 424 | 7188 425 | 11424 426 | 15084 427 | 18895 428 | 37538 429 | 7159 430 | 9969 431 | 32824 432 | 3151 433 | 21606 434 | 6872 435 | 30259 436 | 9528 437 | 38115 438 | 40640 439 | 7581 440 | 7052 441 | 31381 442 | 30552 443 | 31005 444 | 13552 445 | 4642 446 | 9343 447 | 22189 448 | 5160 449 | 17146 450 | 32813 451 | 11582 452 | 32286 453 | 2145 454 | 9477 455 | 40300 456 | 20769 457 | 13186 458 | 2541 459 | 22182 460 | 31769 461 | 7140 462 | 31421 463 | 2485 464 | 27686 465 | 25236 466 | 9498 467 | 40156 468 | 11782 469 | 8566 470 | 12815 471 | 31412 472 | 8978 473 | 32108 474 | 38888 475 | 27679 476 | 35662 477 | 3071 478 | 3066 479 | 10835 480 | 31289 481 | 21941 482 | 1764 483 | 31964 484 | 39216 485 | 24982 486 | 14049 487 | 10421 488 | 9647 489 | 13431 490 | 22997 491 | 32420 492 | 32045 493 | 19772 494 | 11399 495 | 15353 496 | 6598 497 | 23940 498 | 23177 499 | 18167 500 | 19112 501 | 21855 502 | 6281 503 | 894 504 | 4805 505 | 797 506 | 39630 507 | 21180 508 | 6802 509 | 40327 510 | 18043 511 | 15429 512 | 1720 513 | 9027 514 | 3103 515 | 11074 516 | 20664 517 | 39607 518 | 3248 519 | 24914 520 | 8359 521 | 24490 522 | 20481 523 | 35939 524 | 13495 525 | 2794 526 | 37621 527 | 31948 528 | 19039 529 | 8303 530 | 14472 531 | 1234 532 | 25968 533 | 17850 534 | 39764 535 | 9739 536 | 15541 537 | 6527 538 | 13123 539 | 20259 540 | 7736 541 | 31063 542 | 37723 543 | 31451 544 | 31545 545 | 18287 546 | 31276 547 | 8686 548 | 20141 549 | 37971 550 | 10220 551 | 13451 552 | 17393 553 | 38719 554 | 10550 555 | 21607 556 | 21219 557 | 40121 558 | 27384 559 | 12151 560 | 11685 561 | 15157 562 | 12103 563 | 10843 564 | 32498 565 | 30238 566 | 39976 567 | 33364 568 | 26725 569 | 7986 570 | 9614 571 | 35737 572 | 10107 573 | 31354 574 | 23463 575 | 26277 576 | 17723 577 | 30228 578 | 8344 579 | 922 580 | 26141 581 | 2275 582 | 32461 583 | 10444 584 | 13257 585 | 32722 586 | 10004 587 | 22133 588 | 40362 589 | 14475 590 | 39746 591 | 11950 592 | 3811 593 | 26325 594 | 21637 595 | 11218 596 | 10919 597 | 35964 598 | 19650 599 | 32218 600 | 35665 601 | 38958 602 | 8121 603 | 27006 604 | 9354 605 | 38929 606 | 8680 607 | 18869 608 | 30689 609 | 8217 610 | 40292 611 | 31396 612 | 7353 613 | 2119 614 | 9186 615 | 11112 616 | 865 617 | 18934 618 | 10756 619 | 13328 620 | 1030 621 | 3339 622 | 39027 623 | 39096 624 | 39142 625 | 685 626 | 1354 627 | 20024 628 | 3984 629 | 12519 630 | 18574 631 | 20696 632 | 22586 633 | 9218 634 | 30740 635 | 24258 636 | 23166 637 | 8514 638 | 15695 639 | 31090 640 | 14579 641 | 7007 642 | 8464 643 | 37369 644 | 39368 645 | 25781 646 | 19748 647 | 10689 648 | 39706 649 | 2232 650 | 10911 651 | 37085 652 | 22570 653 | 9159 654 | 7738 655 | 40060 656 | 8092 657 | 26428 658 | 5075 659 | 3992 660 | 19403 661 | 40379 662 | 22334 663 | 19174 664 | 11684 665 | 11334 666 | 26972 667 | 38134 668 | 17761 669 | 37850 670 | 10374 671 | 4953 672 | 37980 673 | 6461 674 | 22093 675 | 37917 676 | 4212 677 | 7678 678 | 9121 679 | 12169 680 | 2100 681 | 14474 682 | 32424 683 | 3659 684 | 14036 685 | 13255 686 | 9850 687 | 7253 688 | 4265 689 | 6593 690 | 30853 691 | 24670 692 | 37390 693 | 27604 694 | 19334 695 | 23438 696 | 1159 697 | 37149 698 | 37629 699 | 30147 700 | 22234 701 | 31685 702 | 27296 703 | 38927 704 | 3786 705 | 23123 706 | 15300 707 | 25504 708 | 22762 709 | 27542 710 | 17997 711 | 8685 712 | 20102 713 | 19757 714 | 30871 715 | 10425 716 | 20799 717 | 30777 718 | 18208 719 | 30700 720 | 22936 721 | 14883 722 | 2701 723 | 19744 724 | 8191 725 | 19001 726 | 19664 727 | 19239 728 | 10217 729 | 15307 730 | 14978 731 | 7940 732 | 37070 733 | 10127 734 | 14590 735 | 30805 736 | 39726 737 | 30985 738 | 1844 739 | 21628 740 | 27238 741 | 8944 742 | 8002 743 | 38918 744 | 31024 745 | 27506 746 | 31366 747 | 2979 748 | 24320 749 | 31916 750 | 4736 751 | 15870 752 | 6553 753 | 7074 754 | 37056 755 | 1348 756 | 8463 757 | 4722 758 | 19461 759 | 943 760 | 11014 761 | 6542 762 | 30882 763 | 31860 764 | 40557 765 | 39893 766 | 17646 767 | 19381 768 | 11602 769 | 24979 770 | 7856 771 | 9702 772 | 37303 773 | 32188 774 | 9902 775 | 19303 776 | 22463 777 | 1375 778 | 19071 779 | 10839 780 | 8893 781 | 35945 782 | 9475 783 | 1835 784 | 19395 785 | 39422 786 | 31311 787 | 8001 788 | 27044 789 | 6299 790 | 6255 791 | 14112 792 | 16839 793 | 24492 794 | 37946 795 | 14171 796 | 30951 797 | 39623 798 | 13311 799 | 32493 800 | 22682 801 | 30852 802 | 26281 803 | 35788 804 | 12405 805 | 2308 806 | 40403 807 | 18269 808 | 8822 809 | 3249 810 | 2628 811 | 6474 812 | 15404 813 | 10054 814 | 13026 815 | 27482 816 | 16372 817 | 35935 818 | 4858 819 | 14441 820 | 26337 821 | 21311 822 | 31113 823 | 11046 824 | 40660 825 | 7481 826 | 37896 827 | 7123 828 | 7037 829 | 37536 830 | 10716 831 | 10324 832 | 37458 833 | 37320 834 | 20507 835 | 3345 836 | 31468 837 | 31475 838 | 12244 839 | 10929 840 | 17279 841 | 39145 842 | 23250 843 | 12986 844 | 23860 845 | 32784 846 | 16046 847 | 4872 848 | 2441 849 | 30309 850 | 8742 851 | 16078 852 | 12919 853 | 19409 854 | 6322 855 | 24677 856 | 22908 857 | 1343 858 | 16759 859 | 21823 860 | 3618 861 | 22645 862 | 2165 863 | 27633 864 | 40267 865 | 17758 866 | 11076 867 | 14528 868 | 32891 869 | 9606 870 | 37524 871 | 11575 872 | 7674 873 | 25227 874 | 21608 875 | 8160 876 | 31867 877 | 21833 878 | 26497 879 | 16314 880 | 13881 881 | 19952 882 | 30967 883 | 37640 884 | 11145 885 | 30773 886 | 8604 887 | 37172 888 | 7346 889 | 22861 890 | 11071 891 | 15564 892 | 33426 893 | 37717 894 | 20757 895 | 7456 896 | 15679 897 | 16822 898 | 17035 899 | 35579 900 | 17262 901 | 32445 902 | 3856 903 | 3585 904 | 2805 905 | 32805 906 | 15441 907 | 19756 908 | 31188 909 | 10353 910 | 18843 911 | 30424 912 | 7989 913 | 22546 914 | 4005 915 | 34135 916 | 8433 917 | 37806 918 | 13715 919 | 30729 920 | 10009 921 | 18135 922 | 31470 923 | 38098 924 | 3081 925 | 30774 926 | 7082 927 | 30187 928 | 5123 929 | 26336 930 | 2508 931 | 934 932 | 8775 933 | 30866 934 | 17963 935 | 3907 936 | 2194 937 | 22453 938 | 4006 939 | 10720 940 | 31436 941 | 23015 942 | 7569 943 | 30353 944 | 21648 945 | 2934 946 | 33525 947 | 13727 948 | 31758 949 | 15342 950 | 8054 951 | 30793 952 | 32264 953 | 19354 954 | 18036 955 | 2588 956 | 3862 957 | 32453 958 | 39415 959 | 3155 960 | 8591 961 | 18145 962 | 9237 963 | 16762 964 | 10619 965 | 37845 966 | 19564 967 | 18492 968 | 31656 969 | 37573 970 | 19736 971 | 9516 972 | 13511 973 | 18944 974 | 14557 975 | 26805 976 | 15379 977 | 4964 978 | 17187 979 | 17607 980 | 7330 981 | 13856 982 | 6962 983 | 15452 984 | 23186 985 | 39494 986 | 8272 987 | 30811 988 | 19217 989 | 10620 990 | 27619 991 | 13261 992 | 35643 993 | 9074 994 | 14485 995 | 5064 996 | 16854 997 | 18730 998 | 38034 999 | 25552 1000 | 25628 1001 | -------------------------------------------------------------------------------- /data/training/.keep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ronghanghu/natural-language-object-retrieval/c6ddd5d78e9d4d886abc20d4e1b4421b3795a89e/data/training/.keep -------------------------------------------------------------------------------- /datasets/ReferIt/ImageCLEF/.keep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ronghanghu/natural-language-object-retrieval/c6ddd5d78e9d4d886abc20d4e1b4421b3795a89e/datasets/ReferIt/ImageCLEF/.keep -------------------------------------------------------------------------------- /datasets/ReferIt/ReferitData/.keep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ronghanghu/natural-language-object-retrieval/c6ddd5d78e9d4d886abc20d4e1b4421b3795a89e/datasets/ReferIt/ReferitData/.keep -------------------------------------------------------------------------------- /datasets/download_kitchen_dataset.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | wget -O ./datasets/Kitchen.tar.gz http://www.eecs.berkeley.edu/~ronghang/projects/cvpr16_text_obj_retrieval/Kitchen.tar.gz 3 | tar -xzvf ./datasets/Kitchen.tar.gz -C ./datasets/ 4 | cp ./datasets/Kitchen/split/*.txt ./data/split/ 5 | cp ./datasets/Kitchen/annotation/*.json ./data/metadata/ 6 | -------------------------------------------------------------------------------- /datasets/download_referit_dataset.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | wget -O ./datasets/ReferIt/ReferitData/ReferitData.zip http://tamaraberg.com/referitgame/ReferitData.zip 3 | unzip ./datasets/ReferIt/ReferitData/ReferitData.zip -d ./datasets/ReferIt/ReferitData/ 4 | wget -O ./datasets/ReferIt/ImageCLEF/referitdata.tar.gz http://www.eecs.berkeley.edu/~ronghang/projects/cvpr16_text_obj_retrieval/referitdata.tar.gz 5 | tar -xzvf ./datasets/ReferIt/ImageCLEF/referitdata.tar.gz -C ./datasets/ReferIt/ImageCLEF/ 6 | -------------------------------------------------------------------------------- /demo/demo_data/40429.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ronghanghu/natural-language-object-retrieval/c6ddd5d78e9d4d886abc20d4e1b4421b3795a89e/demo/demo_data/40429.jpg -------------------------------------------------------------------------------- /demo/demo_data/40429.txt: -------------------------------------------------------------------------------- 1 | 1.2600000e+02 2.3400000e+02 2.0000000e+02 3.0500000e+02 2 | 6.9000000e+01 6.9000000e+01 2.5900000e+02 3.3900000e+02 3 | 7.3000000e+01 7.0000000e+01 3.0300000e+02 3.2000000e+02 4 | 0.0000000e+00 6.6000000e+01 4.7800000e+02 3.5800000e+02 5 | 1.9900000e+02 6.0000000e+01 4.7800000e+02 3.1400000e+02 6 | 9.6000000e+01 6.0000000e+01 4.7800000e+02 3.2800000e+02 7 | 7.5000000e+01 2.2000000e+01 2.6000000e+02 3.0600000e+02 8 | 0.0000000e+00 4.8000000e+01 2.8200000e+02 3.2600000e+02 9 | 1.1300000e+02 4.8000000e+01 2.8700000e+02 3.1400000e+02 10 | 1.1400000e+02 1.8000000e+02 2.1400000e+02 3.2500000e+02 11 | 8.1000000e+01 2.2000000e+01 4.1800000e+02 3.2800000e+02 12 | 1.1800000e+02 2.2300000e+02 2.0100000e+02 3.1900000e+02 13 | 9.3000000e+01 1.5400000e+02 2.9600000e+02 3.4000000e+02 14 | 6.3000000e+01 7.4000000e+01 2.1900000e+02 3.2400000e+02 15 | 1.1400000e+02 2.3400000e+02 2.1500000e+02 3.0400000e+02 16 | 1.1100000e+02 5.8000000e+01 3.8400000e+02 3.2800000e+02 17 | 1.5700000e+02 6.0000000e+01 4.3900000e+02 3.2800000e+02 18 | 0.0000000e+00 7.5000000e+01 2.1800000e+02 3.4400000e+02 19 | 0.0000000e+00 1.4600000e+02 4.7800000e+02 3.5800000e+02 20 | 5.3000000e+01 7.0000000e+01 3.0900000e+02 2.7100000e+02 21 | 3.3000000e+01 4.5000000e+01 2.6000000e+02 2.6900000e+02 22 | 6.9000000e+01 7.4000000e+01 2.5900000e+02 2.6300000e+02 23 | 1.1200000e+02 8.1000000e+01 2.1800000e+02 3.0400000e+02 24 | 2.0400000e+02 7.0000000e+01 4.2100000e+02 2.7300000e+02 25 | 0.0000000e+00 1.5000000e+02 2.1800000e+02 3.2600000e+02 26 | 2.0600000e+02 1.1300000e+02 4.7800000e+02 3.0700000e+02 27 | 7.9000000e+01 1.2800000e+02 2.1400000e+02 3.2600000e+02 28 | 3.6000000e+01 7.5000000e+01 3.6400000e+02 3.1300000e+02 29 | 8.0000000e+01 1.1300000e+02 4.7800000e+02 3.3900000e+02 30 | 9.9000000e+01 1.8000000e+02 2.1900000e+02 3.0600000e+02 31 | 3.7000000e+01 1.5400000e+02 2.1300000e+02 3.4000000e+02 32 | 1.0800000e+02 8.3000000e+01 2.3100000e+02 3.3900000e+02 33 | 1.6200000e+02 2.3000000e+01 3.9300000e+02 3.3900000e+02 34 | 9.8000000e+01 1.8000000e+02 2.7600000e+02 3.2800000e+02 35 | 0.0000000e+00 2.4000000e+01 4.7800000e+02 2.8800000e+02 36 | 0.0000000e+00 4.6000000e+01 2.1800000e+02 2.9800000e+02 37 | 1.1100000e+02 1.2900000e+02 2.1300000e+02 3.2500000e+02 38 | 1.5600000e+02 3.1000000e+01 4.7800000e+02 2.8900000e+02 39 | 1.2000000e+02 7.4000000e+01 2.5900000e+02 3.4200000e+02 40 | 3.7000000e+01 1.2000000e+02 2.1300000e+02 3.1800000e+02 41 | 1.1500000e+02 2.3400000e+02 2.0900000e+02 3.4000000e+02 42 | 0.0000000e+00 1.0800000e+02 2.7800000e+02 3.5200000e+02 43 | 0.0000000e+00 4.7000000e+01 3.9400000e+02 2.9800000e+02 44 | 1.9300000e+02 7.0000000e+01 3.0500000e+02 2.7100000e+02 45 | 7.1000000e+01 1.8000000e+02 2.0100000e+02 3.2100000e+02 46 | 9.6000000e+01 4.8000000e+01 2.4200000e+02 3.2600000e+02 47 | 8.2000000e+01 1.2700000e+02 3.0000000e+02 3.0400000e+02 48 | 9.5000000e+01 2.2300000e+02 1.9500000e+02 3.1900000e+02 49 | 1.1100000e+02 1.5300000e+02 4.3900000e+02 3.4000000e+02 50 | 9.1000000e+01 1.9600000e+02 1.9500000e+02 3.0500000e+02 51 | 2.0400000e+02 7.5000000e+01 2.6000000e+02 2.6300000e+02 52 | 6.8000000e+01 1.5300000e+02 2.1300000e+02 3.0900000e+02 53 | 0.0000000e+00 1.7900000e+02 2.1800000e+02 3.5200000e+02 54 | 8.7000000e+01 1.7800000e+02 2.9900000e+02 3.1000000e+02 55 | 1.2400000e+02 1.8600000e+02 2.1300000e+02 3.0500000e+02 56 | 3.6000000e+01 1.0200000e+02 3.1100000e+02 3.3600000e+02 57 | 9.3000000e+01 1.5500000e+02 2.0600000e+02 3.2000000e+02 58 | 9.8000000e+01 1.3000000e+02 2.5400000e+02 3.4100000e+02 59 | 4.6000000e+01 7.4000000e+01 2.1800000e+02 2.5500000e+02 60 | 6.9000000e+01 1.3000000e+02 3.6400000e+02 3.4500000e+02 61 | 7.2000000e+01 2.2200000e+02 2.0000000e+02 3.0400000e+02 62 | 0.0000000e+00 1.5400000e+02 3.0600000e+02 3.2400000e+02 63 | 1.6400000e+02 6.9000000e+01 2.7200000e+02 3.0100000e+02 64 | 2.4300000e+02 5.7000000e+01 4.4000000e+02 3.2900000e+02 65 | 2.9000000e+01 1.5300000e+02 2.5900000e+02 3.1900000e+02 66 | 1.2400000e+02 1.5100000e+02 2.1400000e+02 3.4000000e+02 67 | 0.0000000e+00 1.1300000e+02 3.8400000e+02 3.3400000e+02 68 | 1.9900000e+02 5.9000000e+01 3.8000000e+02 3.1300000e+02 69 | 2.3200000e+02 1.5300000e+02 4.7800000e+02 3.2800000e+02 70 | 7.3000000e+01 4.7000000e+01 4.7800000e+02 2.6900000e+02 71 | 1.8900000e+02 5.5000000e+01 3.0400000e+02 3.1400000e+02 72 | 2.8000000e+01 1.9100000e+02 2.1400000e+02 3.2400000e+02 73 | 1.1300000e+02 1.7900000e+02 3.6400000e+02 3.4100000e+02 74 | 1.2000000e+02 2.3400000e+02 1.9500000e+02 2.8600000e+02 75 | 1.1400000e+02 1.1100000e+02 3.9200000e+02 3.3900000e+02 76 | 5.6000000e+01 2.2300000e+02 2.0200000e+02 3.1900000e+02 77 | 2.4300000e+02 1.5300000e+02 4.7800000e+02 2.8800000e+02 78 | 1.8000000e+02 3.2000000e+01 3.8500000e+02 2.7700000e+02 79 | 1.3300000e+02 9.7000000e+01 4.7800000e+02 3.0400000e+02 80 | 1.9700000e+02 6.0000000e+01 2.8700000e+02 3.0100000e+02 81 | 1.1300000e+02 1.3000000e+02 4.1900000e+02 3.1300000e+02 82 | 1.5300000e+02 6.0000000e+01 3.0500000e+02 3.0100000e+02 83 | 1.2000000e+02 1.5200000e+02 2.6000000e+02 3.4100000e+02 84 | 1.1300000e+02 1.9100000e+02 4.3100000e+02 3.2900000e+02 85 | 2.5000000e+02 6.0000000e+01 4.6200000e+02 2.7300000e+02 86 | 9.3000000e+01 2.1900000e+02 2.9800000e+02 3.1900000e+02 87 | 1.2300000e+02 1.9600000e+02 2.0000000e+02 3.1300000e+02 88 | 1.1400000e+02 2.2400000e+02 2.9800000e+02 3.4100000e+02 89 | 1.5800000e+02 3.2000000e+01 2.8200000e+02 2.8800000e+02 90 | 1.8900000e+02 1.1300000e+02 3.9200000e+02 3.3000000e+02 91 | 0.0000000e+00 2.2200000e+02 2.1200000e+02 3.5200000e+02 92 | 1.8300000e+02 2.4000000e+01 3.2800000e+02 3.0100000e+02 93 | 2.5000000e+02 1.9100000e+02 3.6600000e+02 3.2900000e+02 94 | 1.1500000e+02 2.5400000e+02 2.0100000e+02 3.0400000e+02 95 | 2.0400000e+02 5.9000000e+01 4.7800000e+02 2.3700000e+02 96 | 2.0600000e+02 1.1300000e+02 2.5600000e+02 2.6200000e+02 97 | 1.2000000e+02 2.0900000e+02 2.1500000e+02 3.1300000e+02 98 | 0.0000000e+00 6.6000000e+01 2.1800000e+02 2.4900000e+02 99 | 0.0000000e+00 1.5200000e+02 1.3800000e+02 3.5300000e+02 100 | 1.3800000e+02 7.2000000e+01 2.6000000e+02 2.6900000e+02 101 | -------------------------------------------------------------------------------- /exp-kitchen/cache_kitchen_training_batches.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function, division 2 | 3 | import os 4 | import numpy as np 5 | 6 | import util 7 | import retriever 8 | 9 | trn_imlist_file = './data/split/kitchen_trainval_imlist.txt' 10 | 11 | image_dir = './datasets/Kitchen/images/Kitchen/' 12 | query_file = './data/metadata/kitchen_query_dict.json' 13 | vocab_file = './data/vocabulary.txt' 14 | 15 | N_batch = 50 # batch size during training 16 | T = 20 # unroll timestep of LSTM 17 | 18 | save_image_list_file = './data/kitchen_train_image_list.txt' 19 | save_hdf5_list_file = './data/kitchen_train_hdf5_list.txt' 20 | save_hdf5_dir = './data/kitchen_hdf5_50/' 21 | 22 | imset = set(util.io.load_str_list(trn_imlist_file)) 23 | vocab_dict = retriever.build_vocab_dict_from_file(vocab_file) 24 | query_dict = util.io.load_json(query_file) 25 | 26 | train_pairs = [] 27 | for imname, des in query_dict.iteritems(): 28 | if imname not in imset: 29 | continue 30 | train_pairs += [(imname, d) for d in des] 31 | 32 | # random shuffle training pairs 33 | np.random.seed(3) 34 | perm_idx = np.random.permutation(np.arange(len(train_pairs))) 35 | train_pairs = [train_pairs[n] for n in perm_idx] 36 | 37 | num_train_pairs = len(train_pairs) 38 | num_train_pairs = num_train_pairs - num_train_pairs % N_batch 39 | train_pairs = train_pairs[:num_train_pairs] 40 | num_batch = int(num_train_pairs // N_batch) 41 | 42 | image_list = [] 43 | hdf5_list = [] 44 | 45 | # generate hdf5 files 46 | if not os.path.isdir(save_hdf5_dir): 47 | os.mkdir(save_hdf5_dir) 48 | for n_batch in range(num_batch): 49 | if (n_batch+1) % 10 == 0: 50 | print('writing batch %d / %d' % (n_batch+1, num_batch)) 51 | begin = n_batch * N_batch 52 | end = (n_batch + 1) * N_batch 53 | cont_sentences = np.zeros([T, N_batch], dtype=np.float32) 54 | input_sentences = np.zeros([T, N_batch], dtype=np.float32) 55 | target_sentences = np.zeros([T, N_batch], dtype=np.float32) 56 | for n_pair in range(begin, end): 57 | # Append 0 as dummy label 58 | image_path = image_dir + train_pairs[n_pair][0] + '.JPEG 0' # 0 as dummy label 59 | image_list.append(image_path) 60 | 61 | stream = retriever.sentence2vocab_indices(train_pairs[n_pair][1], vocab_dict) 62 | if len(stream) > T-1: 63 | stream = stream[:T-1] 64 | pad = T - 1 - len(stream) 65 | cont_sentences[:, n_pair-begin] = [0] + [1] * len(stream) + [0] * pad 66 | input_sentences[:, n_pair-begin] = [0] + stream + [-1] * pad 67 | target_sentences[:, n_pair-begin] = stream + [0] + [-1] * pad 68 | h5_filename = save_hdf5_dir + '%d_to_%d.h5' % (begin, end) 69 | retriever.write_batch_to_hdf5(h5_filename, cont_sentences, input_sentences, target_sentences) 70 | hdf5_list.append(h5_filename) 71 | 72 | util.io.save_str_list(image_list, save_image_list_file) 73 | util.io.save_str_list(hdf5_list, save_hdf5_list_file) 74 | -------------------------------------------------------------------------------- /exp-kitchen/caffemodel/.keep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ronghanghu/natural-language-object-retrieval/c6ddd5d78e9d4d886abc20d4e1b4421b3795a89e/exp-kitchen/caffemodel/.keep -------------------------------------------------------------------------------- /exp-kitchen/test_scrc_on_kitchen.py: -------------------------------------------------------------------------------- 1 | from __future__ import division, print_function 2 | 3 | import sys 4 | import numpy as np 5 | import skimage.io 6 | sys.path.append('./external/caffe-natural-language-object-retrieval/python/') 7 | sys.path.append('./external/caffe-natural-language-object-retrieval/examples/coco_caption/') 8 | import caffe 9 | 10 | import util 11 | from captioner import Captioner 12 | import retriever 13 | 14 | ################################################################################ 15 | # Test Parameters 16 | 17 | # distractor_set can be either "kitchen" or "imagenet" 18 | # For "kitchen" experiment, the distractors are sampled from test set itsef 19 | # For "imagenet" experiment, the distractors are sampled from ImageNET distractor images 20 | distractor_set = "kitchen" 21 | # Number of distractors sampled for each object 22 | distractor_per_object = 10 23 | 24 | pretrained_weights_path = './models/scrc_kitchen.caffemodel' 25 | 26 | gpu_id = 0 # the GPU to test the SCRC model 27 | 28 | tst_imlist_file = './data/split/kitchen_test_imlist.txt' 29 | ################################################################################ 30 | 31 | image_dir = './datasets/Kitchen/images/Kitchen/' 32 | 33 | if distractor_set == "kitchen": 34 | distractor_dir = image_dir 35 | distractor_imlist_file = tst_imlist_file 36 | else: 37 | distractor_dir = './datasets/Kitchen/images/ImageNET/' 38 | distractor_imlist_file = './data/split/kitchen_imagenet_imlist.txt' 39 | 40 | query_file = './data/metadata/kitchen_query_dict.json' 41 | vocab_file = './data/vocabulary.txt' 42 | 43 | # utilize the captioner module from LRCN 44 | lstm_net_proto = './prototxt/scrc_word_to_preds_no_spatial_no_context.prototxt' 45 | image_net_proto = './prototxt/VGG_ILSVRC_16_layers_deploy.prototxt' 46 | captioner = Captioner(pretrained_weights_path, image_net_proto, lstm_net_proto, 47 | vocab_file, gpu_id) 48 | captioner.set_image_batch_size(50) 49 | vocab_dict = retriever.build_vocab_dict_from_captioner(captioner) 50 | 51 | # Load image and caption list 52 | imlist = util.io.load_str_list(tst_imlist_file) 53 | num_im = len(imlist) 54 | query_dict = util.io.load_json(query_file) 55 | 56 | # Load distractors 57 | distractor_list = util.io.load_str_list(distractor_imlist_file) 58 | num_distractors = len(distractor_list) 59 | 60 | # Sample distractor images for each test image 61 | distractor_ids_per_im = {} 62 | np.random.seed(3) # fix random seed for test repeatibility 63 | for imname in imlist: 64 | # Sample distractor_per_object*2 distractors to make sure the test image 65 | # itself is not among the distractors (this) 66 | distractor_ids = np.random.choice(num_distractors, 67 | distractor_per_object*2, replace=False) 68 | distractor_names = [distractor_list[n] for n in distractor_ids[:distractor_per_object]] 69 | # Use the second half if the imname is among the first half 70 | if imname not in distractor_names: 71 | distractor_ids_per_im[imname] = distractor_ids[:distractor_per_object] 72 | else: 73 | distractor_ids_per_im[imname] = distractor_ids[distractor_per_object:] 74 | 75 | # Compute descriptors for both object images and distractor images 76 | image_path_list = [image_dir+imname+'.JPEG' for imname in imlist] 77 | distractor_path_list = [distractor_dir+imname+'.JPEG' for imname in distractor_list] 78 | 79 | obj_descriptors = captioner.compute_descriptors(image_path_list) 80 | dis_descriptors = captioner.compute_descriptors(distractor_path_list) 81 | 82 | ################################################################################ 83 | # Test top-1 precision 84 | correct_num = 0 85 | total_num = 0 86 | for n_im in range(num_im): 87 | print('testing image %d / %d' % (n_im, num_im)) 88 | imname = imlist[n_im] 89 | for sentence in query_dict[imname]: 90 | # compute test image (target object) score given the description sentence 91 | obj_score = retriever.score_descriptors(obj_descriptors[n_im:n_im+1, :], 92 | sentence, captioner, vocab_dict)[0] 93 | # compute distractor scores given the description sentence 94 | dis_idx = distractor_ids_per_im[imname] 95 | dis_scores = retriever.score_descriptors(dis_descriptors[dis_idx, :], 96 | sentence, captioner, vocab_dict) 97 | 98 | # for a retrieval to be correct, the object image must score higher than 99 | # all distractor images 100 | correct_num += np.all(obj_score > dis_scores) 101 | total_num += 1 102 | 103 | print('Top-1 precision on the whole test set: %f' % (correct_num/total_num)) 104 | ################################################################################ 105 | -------------------------------------------------------------------------------- /exp-kitchen/train_scrc_kitchen.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | GPU_ID=0 3 | WEIGHTS=./models/coco_pretrained_iter_100000.caffemodel 4 | 5 | caffe train \ 6 | -solver ./prototxt/scrc_kitchen_solver.prototxt \ 7 | -weights $WEIGHTS \ 8 | -gpu $GPU_ID 2>&1 9 | -------------------------------------------------------------------------------- /exp-referit/cache_referit_context_features.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function, division 2 | 3 | import sys 4 | import os 5 | import numpy as np 6 | import skimage.io 7 | sys.path.append('./external/caffe-natural-language-object-retrieval/python/') 8 | sys.path.append('./external/caffe-natural-language-object-retrieval/examples/coco_caption/') 9 | import caffe 10 | 11 | import util 12 | from captioner import Captioner 13 | 14 | 15 | vgg_weights_path = './models/VGG_ILSVRC_16_layers.caffemodel' 16 | gpu_id = 0 17 | 18 | image_dir = './datasets/ReferIt/ImageCLEF/images/' 19 | cached_context_features_dir = './data/referit_context_features/' 20 | 21 | 22 | image_net_proto = './prototxt/VGG_ILSVRC_16_layers_deploy.prototxt' 23 | lstm_net_proto = './prototxt/scrc_word_to_preds_full.prototxt' 24 | vocab_file = './data/vocabulary.txt' 25 | 26 | captioner = Captioner(vgg_weights_path, image_net_proto, lstm_net_proto, vocab_file, gpu_id) 27 | batch_size = 100 28 | captioner.set_image_batch_size(batch_size) 29 | 30 | imlist = util.io.load_str_list('./data/split/referit_all_imlist.txt') 31 | num_im = len(imlist) 32 | 33 | # Load all images into memory 34 | loaded_images = [] 35 | for n_im in range(num_im): 36 | if n_im % 200 == 0: 37 | print('loading image %d / %d into memory' % (n_im, num_im)) 38 | 39 | im = skimage.io.imread(image_dir + imlist[n_im] + '.jpg') 40 | # Gray scale to RGB 41 | if im.ndim == 2: 42 | im = np.tile(im[..., np.newaxis], (1, 1, 3)) 43 | # RGBA to RGB 44 | im = im[:, :, :3] 45 | loaded_images.append(im) 46 | 47 | # Compute fc7 feature from loaded images, as whole image contextual feature 48 | descriptors = captioner.compute_descriptors(loaded_images, output_name='fc7') 49 | 50 | # Save computed contextual features 51 | if not os.path.isdir(cached_context_features_dir): 52 | os.mkdir(cached_context_features_dir) 53 | for n_im in range(num_im): 54 | if n_im % 200 == 0: 55 | print('saving contextual features %d / %d' % (n_im, num_im)) 56 | save_path = cached_context_features_dir + imlist[n_im] + '_fc7.npy' 57 | np.save(save_path, descriptors[n_im, :]) 58 | -------------------------------------------------------------------------------- /exp-referit/cache_referit_training_batches.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function, division 2 | 3 | import os 4 | import numpy as np 5 | 6 | import util 7 | import retriever 8 | 9 | trn_imlist_file = './data/split/referit_trainval_imlist.txt' 10 | 11 | image_dir = './datasets/ReferIt/ImageCLEF/images/' 12 | resized_imcrop_dir = './data/resized_imcrop/' 13 | cached_context_features_dir = './data/referit_context_features/' 14 | 15 | imcrop_dict_file = './data/metadata/referit_imcrop_dict.json' 16 | imcrop_bbox_dict_file = './data/metadata/referit_imcrop_bbox_dict.json' 17 | imsize_dict_file = './data/metadata/referit_imsize_dict.json' 18 | query_file = './data/metadata/referit_query_dict.json' 19 | vocab_file = './data/vocabulary.txt' 20 | 21 | N_batch = 50 # batch size during training 22 | T = 20 # unroll timestep of LSTM 23 | 24 | save_imcrop_list_file = './data/training/train_bbox_context_imcrop_list.txt' 25 | save_wholeim_list_file = './data/training/train_bbox_context_wholeim_list.txt' 26 | save_hdf5_text_list_file = './data/training/train_bbox_context_hdf5_text_list.txt' 27 | save_hdf5_bbox_list_file = './data/training/train_bbox_context_hdf5_bbox_list.txt' 28 | save_hdf5_dir = './data/training/hdf5_50_bbox_context/' 29 | 30 | imset = set(util.io.load_str_list(trn_imlist_file)) 31 | vocab_dict = retriever.build_vocab_dict_from_file(vocab_file) 32 | query_dict = util.io.load_json(query_file) 33 | imsize_dict = util.io.load_json(imsize_dict_file) 34 | imcrop_bbox_dict = util.io.load_json(imcrop_bbox_dict_file) 35 | 36 | train_pairs = [] 37 | for imcrop_name, des in query_dict.iteritems(): 38 | imname = imcrop_name.split('_', 1)[0] 39 | if imname not in imset: 40 | continue 41 | imsize = np.array(imsize_dict[imname]) 42 | bbox = np.array(imcrop_bbox_dict[imcrop_name]) 43 | bbox_feat = retriever.compute_spatial_feat(bbox, imsize) 44 | context_feature = np.load(cached_context_features_dir + imname + '_fc7.npy') 45 | train_pairs += [(imcrop_name, d, bbox_feat, imname, context_feature) for d in des] 46 | 47 | # random shuffle training pairs 48 | np.random.seed(3) 49 | perm_idx = np.random.permutation(np.arange(len(train_pairs))) 50 | train_pairs = [train_pairs[n] for n in perm_idx] 51 | 52 | num_train_pairs = len(train_pairs) 53 | num_train_pairs = num_train_pairs - num_train_pairs % N_batch 54 | train_pairs = train_pairs[:num_train_pairs] 55 | num_batch = int(num_train_pairs // N_batch) 56 | 57 | imcrop_list = [] 58 | wholeim_list = [] 59 | hdf5_text_list = [] 60 | hdf5_bbox_list = [] 61 | 62 | # generate hdf5 files 63 | if not os.path.isdir(save_hdf5_dir): 64 | os.mkdir(save_hdf5_dir) 65 | for n_batch in range(num_batch): 66 | if (n_batch+1) % 100 == 0: 67 | print('writing batch %d / %d' % (n_batch+1, num_batch)) 68 | begin = n_batch * N_batch 69 | end = (n_batch + 1) * N_batch 70 | cont_sentences = np.zeros([T, N_batch], dtype=np.float32) 71 | input_sentences = np.zeros([T, N_batch], dtype=np.float32) 72 | target_sentences = np.zeros([T, N_batch], dtype=np.float32) 73 | bbox_coordinates = np.zeros([N_batch, 8], dtype=np.float32) 74 | fc7_context = np.zeros([N_batch, 4096], dtype=np.float32) 75 | for n_pair in range(begin, end): 76 | # Append 0 as dummy label 77 | imcrop_path = resized_imcrop_dir + train_pairs[n_pair][0] + '.png 0' 78 | imcrop_list.append(imcrop_path) 79 | # Append 0 as dummy label 80 | wholeim_path = image_dir + train_pairs[n_pair][3] + '.jpg 0' 81 | wholeim_list.append(wholeim_path) 82 | stream = retriever.sentence2vocab_indices(train_pairs[n_pair][1], 83 | vocab_dict) 84 | if len(stream) > T-1: 85 | stream = stream[:T-1] 86 | pad = T - 1 - len(stream) 87 | cont_sentences[:, n_pair-begin] = [0] + [1] * len(stream) + [0] * pad 88 | input_sentences[:, n_pair-begin] = [0] + stream + [-1] * pad 89 | target_sentences[:, n_pair-begin] = stream + [0] + [-1] * pad 90 | bbox_coordinates[n_pair-begin, :] = np.squeeze(train_pairs[n_pair][2]) 91 | fc7_context[n_pair-begin, :] = train_pairs[n_pair][4] 92 | h5_text_filename = save_hdf5_dir + 'text_%d_to_%d.h5' % (begin, end) 93 | h5_bbox_filename = save_hdf5_dir + 'bbox_context_%d_to_%d.h5' % (begin, end) 94 | retriever.write_batch_to_hdf5(h5_text_filename, cont_sentences, 95 | input_sentences, target_sentences) 96 | retriever.write_bbox_context_to_hdf5(h5_bbox_filename, bbox_coordinates, 97 | fc7_context) 98 | hdf5_text_list.append(h5_text_filename) 99 | hdf5_bbox_list.append(h5_bbox_filename) 100 | 101 | util.io.save_str_list(imcrop_list, save_imcrop_list_file) 102 | util.io.save_str_list(wholeim_list, save_wholeim_list_file) 103 | util.io.save_str_list(hdf5_text_list, save_hdf5_text_list_file) 104 | util.io.save_str_list(hdf5_bbox_list, save_hdf5_bbox_list_file) 105 | -------------------------------------------------------------------------------- /exp-referit/caffemodel/.keep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ronghanghu/natural-language-object-retrieval/c6ddd5d78e9d4d886abc20d4e1b4421b3795a89e/exp-referit/caffemodel/.keep -------------------------------------------------------------------------------- /exp-referit/initialize_weights_scrc_full.py: -------------------------------------------------------------------------------- 1 | from __future__ import division, print_function 2 | 3 | import sys 4 | import numpy as np 5 | sys.path.append('./external/caffe-natural-language-object-retrieval/python/') 6 | import caffe 7 | 8 | old_prototxt = './prototxt/coco_pretrained.prototxt' 9 | old_caffemodel = './models/coco_pretrained_iter_100000.caffemodel' 10 | new_prototxt = './prototxt/scrc_full_vgg_buffer_50.prototxt' 11 | new_caffemodel = './exp-referit/caffemodel/scrc_full_vgg_init.caffemodel' 12 | old_net = caffe.Net(old_prototxt, old_caffemodel, caffe.TRAIN) 13 | new_net = caffe.Net(new_prototxt, old_caffemodel, caffe.TRAIN) 14 | 15 | new_net.params['fc8_context'][0].data[...] = old_net.params['fc8'][0].data[...] 16 | new_net.params['fc8_context'][1].data[...] = old_net.params['fc8'][1].data[...] 17 | 18 | new_net.params['lstm2-extended'][0].data[...] = old_net.params['lstm2'][0].data[...] 19 | new_net.params['lstm2-extended'][1].data[...] = old_net.params['lstm2'][1].data[...] 20 | new_net.params['lstm2-extended'][2].data[:, :1000] = old_net.params['lstm2'][2].data[...] 21 | new_net.params['lstm2-extended'][2].data[:, 1000:] = 0 22 | new_net.params['lstm2-extended'][3].data[...] = old_net.params['lstm2'][3].data[...] 23 | 24 | new_net.params['lstm2_context'][0].data[...] = old_net.params['lstm2'][0].data[...] 25 | new_net.params['lstm2_context'][1].data[...] = old_net.params['lstm2'][1].data[...] 26 | new_net.params['lstm2_context'][2].data[...] = old_net.params['lstm2'][2].data[...] 27 | new_net.params['lstm2_context'][3].data[...] = old_net.params['lstm2'][3].data[...] 28 | 29 | new_net.save(new_caffemodel) 30 | -------------------------------------------------------------------------------- /exp-referit/initialize_weights_scrc_no_context.py: -------------------------------------------------------------------------------- 1 | from __future__ import division, print_function 2 | 3 | import sys 4 | import numpy as np 5 | sys.path.append('./external/caffe-natural-language-object-retrieval/python/') 6 | import caffe 7 | 8 | old_prototxt = './prototxt/coco_pretrained.prototxt' 9 | old_caffemodel = './models/coco_pretrained_iter_100000.caffemodel' 10 | new_prototxt = './prototxt/scrc_no_context_vgg_buffer_50.prototxt' 11 | new_caffemodel = './exp-referit/caffemodel/scrc_no_context_vgg_init.caffemodel' 12 | old_net = caffe.Net(old_prototxt, old_caffemodel, caffe.TRAIN) 13 | new_net = caffe.Net(new_prototxt, old_caffemodel, caffe.TRAIN) 14 | 15 | new_net.params['lstm2-extended'][0].data[...] = old_net.params['lstm2'][0].data[...] 16 | new_net.params['lstm2-extended'][1].data[...] = old_net.params['lstm2'][1].data[...] 17 | new_net.params['lstm2-extended'][2].data[:, :1000] = old_net.params['lstm2'][2].data[...] 18 | new_net.params['lstm2-extended'][2].data[:, 1000:] = 0 19 | new_net.params['lstm2-extended'][3].data[...] = old_net.params['lstm2'][3].data[...] 20 | 21 | new_net.save(new_caffemodel) 22 | -------------------------------------------------------------------------------- /exp-referit/preprocess_dataset.py: -------------------------------------------------------------------------------- 1 | from __future__ import division, print_function 2 | 3 | import os 4 | import numpy as np 5 | import scipy.io as sio 6 | import skimage 7 | import skimage.io 8 | import skimage.transform 9 | 10 | import util 11 | 12 | 13 | def load_imcrop(imlist, mask_dir): 14 | imcrop_dict = {im_name: [] for im_name in imlist} 15 | imcroplist = [] 16 | masklist = os.listdir(mask_dir) 17 | for mask_name in masklist: 18 | imcrop_name = mask_name.split('.', 1)[0] 19 | imcroplist.append(imcrop_name) 20 | im_name = imcrop_name.split('_', 1)[0] 21 | imcrop_dict[im_name].append(imcrop_name) 22 | return imcroplist, imcrop_dict 23 | 24 | 25 | def load_image_size(imlist, image_dir): 26 | num_im = len(imlist) 27 | imsize_dict = {} 28 | for n_im in range(num_im): 29 | if n_im % 200 == 0: 30 | print('processing image %d / %d' % (n_im, num_im)) 31 | im = skimage.io.imread(image_dir + imlist[n_im] + '.jpg') 32 | imsize_dict[imlist[n_im]] = [im.shape[1], im.shape[0]] # [width, height] 33 | return imsize_dict 34 | 35 | 36 | def load_referit_annotation(imcroplist, annotation_file): 37 | print('loading ReferIt dataset annotations...') 38 | query_dict = {imcrop_name: [] for imcrop_name in imcroplist} 39 | with open(annotation_file) as f: 40 | raw_annotation = f.readlines() 41 | for s in raw_annotation: 42 | # example annotation line: 43 | # 8756_2.jpg~sunray at very top~.33919597989949750~.023411371237458192 44 | splits = s.strip().split('~', 2) 45 | # example: 8756_2 (segmentation regions) 46 | imcrop_name = splits[0].split('.', 1)[0] 47 | # example: 'sunray at very top' 48 | description = splits[1] 49 | # construct imcrop_name - discription list dictionary 50 | # an image crop can have zero or mutiple annotations 51 | query_dict[imcrop_name].append(description) 52 | return query_dict 53 | 54 | 55 | def load_and_resize_imcrop(mask_dir, image_dir, resized_imcrop_dir): 56 | print('loading image crop bounding boxes...') 57 | imcrop_bbox_dict = {} 58 | masklist = os.listdir(mask_dir) 59 | if not os.path.isdir(resized_imcrop_dir): 60 | os.mkdir(resized_imcrop_dir) 61 | for n in range(len(masklist)): 62 | if n % 200 == 0: 63 | print('processing image crop %d / %d' % (n, len(masklist))) 64 | mask_name = masklist[n] 65 | mask = sio.loadmat(mask_dir + mask_name)['segimg_t'] 66 | idx = np.nonzero(mask == 0) 67 | x_min, x_max = np.min(idx[1]), np.max(idx[1]) 68 | y_min, y_max = np.min(idx[0]), np.max(idx[0]) 69 | bbox = [x_min, y_min, x_max, y_max] 70 | imcrop_name = mask_name.split('.', 1)[0] 71 | imcrop_bbox_dict[imcrop_name] = bbox 72 | 73 | # resize the image crops 74 | imname = imcrop_name.split('_', 1)[0] + '.jpg' 75 | image_path = image_dir + imname 76 | im = skimage.io.imread(image_path) 77 | # Gray scale to RGB 78 | if im.ndim == 2: 79 | im = np.tile(im[..., np.newaxis], (1, 1, 3)) 80 | # RGBA to RGB 81 | im = im[:, :, :3] 82 | resized_im = skimage.transform.resize(im[y_min:y_max+1, 83 | x_min:x_max+1, :], [224, 224]) 84 | save_path = resized_imcrop_dir + imcrop_name + '.png' 85 | skimage.io.imsave(save_path, resized_im) 86 | return imcrop_bbox_dict 87 | 88 | 89 | def main(): 90 | image_dir = './datasets/ReferIt/ImageCLEF/images/' 91 | mask_dir = './datasets/ReferIt/ImageCLEF/mask/' 92 | annotation_file = './datasets/ReferIt/ReferitData/RealGames.txt' 93 | imlist_file = './data/split/referit_all_imlist.txt' 94 | metadata_dir = './data/metadata/' 95 | resized_imcrop_dir = './data/resized_imcrop/' 96 | 97 | imlist = util.io.load_str_list(imlist_file) 98 | imsize_dict = load_image_size(imlist, image_dir) 99 | imcroplist, imcrop_dict = load_imcrop(imlist, mask_dir) 100 | query_dict = load_referit_annotation(imcroplist, annotation_file) 101 | imcrop_bbox_dict = load_and_resize_imcrop(mask_dir, image_dir, 102 | resized_imcrop_dir) 103 | 104 | util.io.save_json(imsize_dict, metadata_dir + 'referit_imsize_dict.json') 105 | util.io.save_json(imcrop_dict, metadata_dir + 'referit_imcrop_dict.json') 106 | util.io.save_json(query_dict, metadata_dir + 'referit_query_dict.json') 107 | util.io.save_json(imcrop_bbox_dict, metadata_dir + 'referit_imcrop_bbox_dict.json') 108 | 109 | if __name__ == '__main__': 110 | main() 111 | -------------------------------------------------------------------------------- /exp-referit/test_scrc_on_referit.py: -------------------------------------------------------------------------------- 1 | from __future__ import division, print_function 2 | 3 | import sys 4 | import numpy as np 5 | import skimage.io 6 | sys.path.append('./external/caffe-natural-language-object-retrieval/python/') 7 | sys.path.append('./external/caffe-natural-language-object-retrieval/examples/coco_caption/') 8 | import caffe 9 | 10 | import util 11 | from captioner import Captioner 12 | import retriever 13 | 14 | ################################################################################ 15 | # Test Parameters 16 | 17 | # Test on either all annotated regions, or top-100 EdgeBox proposals 18 | # See Section 4.1 in the paper for details 19 | candidate_regions = 'proposal_regions' 20 | # candidate_regions = 'annotated_regions' 21 | 22 | # Whether or not scene-level context are used in predictions 23 | use_context = True 24 | 25 | if use_context: 26 | lstm_net_proto = './prototxt/scrc_word_to_preds_full.prototxt' 27 | pretrained_weights_path = './models/scrc_full_vgg.caffemodel' 28 | else: 29 | lstm_net_proto = './prototxt/scrc_word_to_preds_no_context.prototxt' 30 | pretrained_weights_path = './models/scrc_no_context_vgg.caffemodel' 31 | 32 | gpu_id = 0 # the GPU to test the SCRC model 33 | correct_IoU_threshold = 0.5 34 | 35 | tst_imlist_file = './data/split/referit_test_imlist.txt' 36 | ################################################################################ 37 | 38 | image_dir = './datasets/ReferIt/ImageCLEF/images/' 39 | proposal_dir = './data/referit_edgeboxes_top100/' 40 | cached_context_features_dir = './data/referit_context_features/' 41 | 42 | imcrop_dict_file = './data/metadata/referit_imcrop_dict.json' 43 | imcrop_bbox_dict_file = './data/metadata/referit_imcrop_bbox_dict.json' 44 | query_file = './data/metadata/referit_query_dict.json' 45 | vocab_file = './data/vocabulary.txt' 46 | 47 | # utilize the captioner module from LRCN 48 | image_net_proto = './prototxt/VGG_ILSVRC_16_layers_deploy.prototxt' 49 | captioner = Captioner(pretrained_weights_path, image_net_proto, lstm_net_proto, 50 | vocab_file, gpu_id) 51 | captioner.set_image_batch_size(50) 52 | vocab_dict = retriever.build_vocab_dict_from_captioner(captioner) 53 | 54 | # Load image and caption list 55 | imlist = util.io.load_str_list(tst_imlist_file) 56 | num_im = len(imlist) 57 | query_dict = util.io.load_json(query_file) 58 | imcrop_dict = util.io.load_json(imcrop_dict_file) 59 | imcrop_bbox_dict = util.io.load_json(imcrop_bbox_dict_file) 60 | 61 | # Load candidate regions (bounding boxes) 62 | load_proposal = (candidate_regions == 'proposal_regions') 63 | candidate_boxes_dict = {imname: None for imname in imlist} 64 | for n_im in range(num_im): 65 | if n_im % 1000 == 0: 66 | print('loading candidate regions %d / %d' % (n_im, num_im)) 67 | imname = imlist[n_im] 68 | if load_proposal: 69 | proposal_file_name = imname + '.txt' 70 | boxes = np.loadtxt(proposal_dir + proposal_file_name) 71 | boxes = boxes.astype(int).reshape((-1, 4)) 72 | else: 73 | boxes = [imcrop_bbox_dict[imcrop_name] 74 | for imcrop_name in imcrop_dict[imname]] 75 | boxes = np.array(boxes).astype(int).reshape((-1, 4)) 76 | candidate_boxes_dict[imname] = boxes 77 | 78 | 79 | # Load cached whole-image contextual features 80 | if use_context: 81 | context_features_dict = {imname: None for imname in imlist} 82 | for n_im in range(num_im): 83 | if n_im % 1000 == 0: 84 | print('loading contextual features %d / %d' % (n_im, num_im)) 85 | imname = imlist[n_im] 86 | cached_context_features_file = cached_context_features_dir + imname + '_fc7.npy' 87 | context_features_dict[imname] = np.load(cached_context_features_file).reshape((1, 4096)) 88 | 89 | ################################################################################ 90 | # Test recall 91 | K = 100 # evaluate recall at 1, 2, ..., K 92 | topK_correct_num = np.zeros(K, dtype=np.float32) 93 | total_num = 0 94 | for n_im in range(num_im): 95 | print('testing image %d / %d' % (n_im, num_im)) 96 | imname = imlist[n_im] 97 | imcrop_names = imcrop_dict[imname] 98 | candidate_boxes = candidate_boxes_dict[imname] 99 | 100 | im = skimage.io.imread(image_dir + imname + '.jpg') 101 | imsize = np.array([im.shape[1], im.shape[0]]) # [width, height] 102 | 103 | # Compute local descriptors (local image feature + spatial feature) 104 | descriptors = retriever.compute_descriptors_edgebox(captioner, im, 105 | candidate_boxes) 106 | spatial_feats = retriever.compute_spatial_feat(candidate_boxes, imsize) 107 | descriptors = np.concatenate((descriptors, spatial_feats), axis=1) 108 | 109 | num_imcrop = len(imcrop_names) 110 | num_proposal = candidate_boxes.shape[0] 111 | for n_imcrop in range(num_imcrop): 112 | imcrop_name = imcrop_names[n_imcrop] 113 | if imcrop_name not in query_dict: 114 | continue 115 | gt_bbox = np.array(imcrop_bbox_dict[imcrop_name]) 116 | IoUs = retriever.compute_iou(candidate_boxes, gt_bbox) 117 | for n_sentence in range(len(query_dict[imcrop_name])): 118 | sentence = query_dict[imcrop_name][n_sentence] 119 | # Scores for each candidate region 120 | if use_context: 121 | scores = retriever.score_descriptors_context(descriptors, sentence, 122 | context_features_dict[imname], captioner, vocab_dict) 123 | else: 124 | scores = retriever.score_descriptors(descriptors, sentence, 125 | captioner, vocab_dict) 126 | 127 | # Evaluate the correctness of top K predictions 128 | topK_ids = np.argsort(-scores)[:K] 129 | topK_IoUs = IoUs[topK_ids] 130 | # whether the K-th (ranking from high to low) candidate is correct 131 | topK_is_correct = np.zeros(K, dtype=bool) 132 | topK_is_correct[:len(topK_ids)] = (topK_IoUs >= correct_IoU_threshold) 133 | # whether at least one of the top K candidates is correct 134 | topK_any_correct = (np.cumsum(topK_is_correct) > 0) 135 | topK_correct_num += topK_any_correct 136 | total_num += 1 137 | 138 | # print intermediate results during testing 139 | if (n_im+1) % 1000 == 0: 140 | print('Recall on first %d test images' % (n_im+1)) 141 | for k in [0, 10-1]: 142 | print('\trecall @ %d = %f' % (k+1, topK_correct_num[k]/total_num)) 143 | 144 | print('Final recall on the whole test set') 145 | for k in [0, 10-1]: 146 | print('\trecall @ %d = %f' % (k+1, topK_correct_num[k]/total_num)) 147 | ################################################################################ 148 | -------------------------------------------------------------------------------- /exp-referit/train_scrc_full_on_referit.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | GPU_ID=0 3 | WEIGHTS=./exp-referit/caffemodel/scrc_full_vgg_init.caffemodel 4 | 5 | caffe train \ 6 | -solver ./prototxt/scrc_full_vgg_solver.prototxt \ 7 | -weights $WEIGHTS \ 8 | -gpu $GPU_ID 2>&1 9 | -------------------------------------------------------------------------------- /exp-referit/train_scrc_no_context_on_referit.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | GPU_ID=0 3 | WEIGHTS=./exp-referit/caffemodel/scrc_no_context_vgg_init.caffemodel 4 | 5 | caffe train \ 6 | -solver ./prototxt/scrc_no_context_vgg_solver.prototxt \ 7 | -weights $WEIGHTS \ 8 | -gpu $GPU_ID 2>&1 9 | -------------------------------------------------------------------------------- /external/download_caffe.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | wget -O ./external/caffe-natural-language-object-retrieval.zip http://www.eecs.berkeley.edu/~ronghang/projects/cvpr16_text_obj_retrieval/caffe-natural-language-object-retrieval.zip 3 | unzip ./external/caffe-natural-language-object-retrieval.zip -d ./external 4 | -------------------------------------------------------------------------------- /models/download_trained_models.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | wget -O ./models/VGG_ILSVRC_16_layers.caffemodel http://www.eecs.berkeley.edu/~ronghang/projects/cvpr16_text_obj_retrieval/models/VGG_ILSVRC_16_layers.caffemodel 3 | wget -O ./models/coco_pretrained_iter_100000.caffemodel http://www.eecs.berkeley.edu/~ronghang/projects/cvpr16_text_obj_retrieval/models/coco_pretrained_iter_100000.caffemodel 4 | wget -O ./models/scrc_full_vgg.caffemodel http://www.eecs.berkeley.edu/~ronghang/projects/cvpr16_text_obj_retrieval/models/scrc_full_vgg.caffemodel 5 | wget -O ./models/scrc_no_context_vgg.caffemodel http://www.eecs.berkeley.edu/~ronghang/projects/cvpr16_text_obj_retrieval/models/scrc_no_context_vgg.caffemodel 6 | 7 | wget -O ./models/scrc_kitchen.caffemodel http://www.eecs.berkeley.edu/~ronghang/projects/cvpr16_text_obj_retrieval/models/scrc_kitchen.caffemodel 8 | -------------------------------------------------------------------------------- /prototxt/VGG_ILSVRC_16_layers_deploy.prototxt: -------------------------------------------------------------------------------- 1 | name: "VGG_ILSVRC_16_layers" 2 | input: "data" 3 | input_dim: 10 4 | input_dim: 3 5 | input_dim: 224 6 | input_dim: 224 7 | layer { 8 | name: "conv1_1" 9 | type: "Convolution" 10 | bottom: "data" 11 | top: "conv1_1" 12 | convolution_param { 13 | num_output: 64 14 | pad: 1 15 | kernel_size: 3 16 | } 17 | } 18 | layer { 19 | name: "relu1_1" 20 | type: "ReLU" 21 | bottom: "conv1_1" 22 | top: "conv1_1" 23 | } 24 | layer { 25 | name: "conv1_2" 26 | type: "Convolution" 27 | bottom: "conv1_1" 28 | top: "conv1_2" 29 | convolution_param { 30 | num_output: 64 31 | pad: 1 32 | kernel_size: 3 33 | } 34 | } 35 | layer { 36 | name: "relu1_2" 37 | type: "ReLU" 38 | bottom: "conv1_2" 39 | top: "conv1_2" 40 | } 41 | layer { 42 | name: "pool1" 43 | type: "Pooling" 44 | bottom: "conv1_2" 45 | top: "pool1" 46 | pooling_param { 47 | pool: MAX 48 | kernel_size: 2 49 | stride: 2 50 | } 51 | } 52 | layer { 53 | name: "conv2_1" 54 | type: "Convolution" 55 | bottom: "pool1" 56 | top: "conv2_1" 57 | convolution_param { 58 | num_output: 128 59 | pad: 1 60 | kernel_size: 3 61 | } 62 | } 63 | layer { 64 | name: "relu2_1" 65 | type: "ReLU" 66 | bottom: "conv2_1" 67 | top: "conv2_1" 68 | } 69 | layer { 70 | name: "conv2_2" 71 | type: "Convolution" 72 | bottom: "conv2_1" 73 | top: "conv2_2" 74 | convolution_param { 75 | num_output: 128 76 | pad: 1 77 | kernel_size: 3 78 | } 79 | } 80 | layer { 81 | name: "relu2_2" 82 | type: "ReLU" 83 | bottom: "conv2_2" 84 | top: "conv2_2" 85 | } 86 | layer { 87 | name: "pool2" 88 | type: "Pooling" 89 | bottom: "conv2_2" 90 | top: "pool2" 91 | pooling_param { 92 | pool: MAX 93 | kernel_size: 2 94 | stride: 2 95 | } 96 | } 97 | layer { 98 | name: "conv3_1" 99 | type: "Convolution" 100 | bottom: "pool2" 101 | top: "conv3_1" 102 | convolution_param { 103 | num_output: 256 104 | pad: 1 105 | kernel_size: 3 106 | } 107 | } 108 | layer { 109 | name: "relu3_1" 110 | type: "ReLU" 111 | bottom: "conv3_1" 112 | top: "conv3_1" 113 | } 114 | layer { 115 | name: "conv3_2" 116 | type: "Convolution" 117 | bottom: "conv3_1" 118 | top: "conv3_2" 119 | convolution_param { 120 | num_output: 256 121 | pad: 1 122 | kernel_size: 3 123 | } 124 | } 125 | layer { 126 | name: "relu3_2" 127 | type: "ReLU" 128 | bottom: "conv3_2" 129 | top: "conv3_2" 130 | } 131 | layer { 132 | name: "conv3_3" 133 | type: "Convolution" 134 | bottom: "conv3_2" 135 | top: "conv3_3" 136 | convolution_param { 137 | num_output: 256 138 | pad: 1 139 | kernel_size: 3 140 | } 141 | } 142 | layer { 143 | name: "relu3_3" 144 | type: "ReLU" 145 | bottom: "conv3_3" 146 | top: "conv3_3" 147 | } 148 | layer { 149 | name: "pool3" 150 | type: "Pooling" 151 | bottom: "conv3_3" 152 | top: "pool3" 153 | pooling_param { 154 | pool: MAX 155 | kernel_size: 2 156 | stride: 2 157 | } 158 | } 159 | layer { 160 | name: "conv4_1" 161 | type: "Convolution" 162 | bottom: "pool3" 163 | top: "conv4_1" 164 | convolution_param { 165 | num_output: 512 166 | pad: 1 167 | kernel_size: 3 168 | } 169 | } 170 | layer { 171 | name: "relu4_1" 172 | type: "ReLU" 173 | bottom: "conv4_1" 174 | top: "conv4_1" 175 | } 176 | layer { 177 | name: "conv4_2" 178 | type: "Convolution" 179 | bottom: "conv4_1" 180 | top: "conv4_2" 181 | convolution_param { 182 | num_output: 512 183 | pad: 1 184 | kernel_size: 3 185 | } 186 | } 187 | layer { 188 | name: "relu4_2" 189 | type: "ReLU" 190 | bottom: "conv4_2" 191 | top: "conv4_2" 192 | } 193 | layer { 194 | name: "conv4_3" 195 | type: "Convolution" 196 | bottom: "conv4_2" 197 | top: "conv4_3" 198 | convolution_param { 199 | num_output: 512 200 | pad: 1 201 | kernel_size: 3 202 | } 203 | } 204 | layer { 205 | name: "relu4_3" 206 | type: "ReLU" 207 | bottom: "conv4_3" 208 | top: "conv4_3" 209 | } 210 | layer { 211 | name: "pool4" 212 | type: "Pooling" 213 | bottom: "conv4_3" 214 | top: "pool4" 215 | pooling_param { 216 | pool: MAX 217 | kernel_size: 2 218 | stride: 2 219 | } 220 | } 221 | layer { 222 | name: "conv5_1" 223 | type: "Convolution" 224 | bottom: "pool4" 225 | top: "conv5_1" 226 | convolution_param { 227 | num_output: 512 228 | pad: 1 229 | kernel_size: 3 230 | } 231 | } 232 | layer { 233 | name: "relu5_1" 234 | type: "ReLU" 235 | bottom: "conv5_1" 236 | top: "conv5_1" 237 | } 238 | layer { 239 | name: "conv5_2" 240 | type: "Convolution" 241 | bottom: "conv5_1" 242 | top: "conv5_2" 243 | convolution_param { 244 | num_output: 512 245 | pad: 1 246 | kernel_size: 3 247 | } 248 | } 249 | layer { 250 | name: "relu5_2" 251 | type: "ReLU" 252 | bottom: "conv5_2" 253 | top: "conv5_2" 254 | } 255 | layer { 256 | name: "conv5_3" 257 | type: "Convolution" 258 | bottom: "conv5_2" 259 | top: "conv5_3" 260 | convolution_param { 261 | num_output: 512 262 | pad: 1 263 | kernel_size: 3 264 | } 265 | } 266 | layer { 267 | name: "relu5_3" 268 | type: "ReLU" 269 | bottom: "conv5_3" 270 | top: "conv5_3" 271 | } 272 | layer { 273 | name: "pool5" 274 | type: "Pooling" 275 | bottom: "conv5_3" 276 | top: "pool5" 277 | pooling_param { 278 | pool: MAX 279 | kernel_size: 2 280 | stride: 2 281 | } 282 | } 283 | layer { 284 | name: "fc6" 285 | type: "InnerProduct" 286 | bottom: "pool5" 287 | top: "fc6" 288 | inner_product_param { 289 | num_output: 4096 290 | } 291 | } 292 | layer { 293 | name: "relu6" 294 | type: "ReLU" 295 | bottom: "fc6" 296 | top: "fc6" 297 | } 298 | layer { 299 | name: "drop6" 300 | type: "Dropout" 301 | bottom: "fc6" 302 | top: "fc6" 303 | dropout_param { 304 | dropout_ratio: 0.5 305 | } 306 | } 307 | layer { 308 | name: "fc7" 309 | type: "InnerProduct" 310 | bottom: "fc6" 311 | top: "fc7" 312 | inner_product_param { 313 | num_output: 4096 314 | } 315 | } 316 | layer { 317 | name: "relu7" 318 | type: "ReLU" 319 | bottom: "fc7" 320 | top: "fc7" 321 | } 322 | layer { 323 | name: "drop7" 324 | type: "Dropout" 325 | bottom: "fc7" 326 | top: "fc7" 327 | dropout_param { 328 | dropout_ratio: 0.5 329 | } 330 | } 331 | layer { 332 | name: "fc8" 333 | type: "InnerProduct" 334 | bottom: "fc7" 335 | top: "fc8" 336 | inner_product_param { 337 | num_output: 1000 338 | } 339 | } 340 | layer { 341 | name: "prob" 342 | type: "Softmax" 343 | bottom: "fc8" 344 | top: "prob" 345 | } 346 | -------------------------------------------------------------------------------- /prototxt/coco_pretrained.prototxt: -------------------------------------------------------------------------------- 1 | state { 2 | phase: TRAIN level: 0 3 | stage: 'freeze-convnet' stage: 'factored' stage: '2-layer' 4 | } 5 | 6 | input: "data" 7 | input_shape { dim: 1, dim: 3, dim: 224, dim: 224 } 8 | input: "cont_sentence" 9 | input_shape { dim: 20, dim: 1 } 10 | input: "input_sentence" 11 | input_shape { dim: 20, dim: 1 } 12 | 13 | layer { 14 | name: "conv1_1" 15 | type: "Convolution" 16 | bottom: "data" 17 | top: "conv1_1" 18 | param { lr_mult: 0 } 19 | param { lr_mult: 0 decay_mult: 0 } 20 | include { stage: "freeze-convnet" } 21 | convolution_param { 22 | num_output: 64 23 | pad: 1 24 | kernel_size: 3 25 | } 26 | } 27 | layer { 28 | name: "conv1_1" 29 | type: "Convolution" 30 | bottom: "data" 31 | top: "conv1_1" 32 | param { lr_mult: 0.1 } 33 | param { lr_mult: 0.2 decay_mult: 0} 34 | exclude { stage: "freeze-convnet" } 35 | convolution_param { 36 | num_output: 64 37 | pad: 1 38 | kernel_size: 3 39 | } 40 | } 41 | layer { 42 | name: "relu1_1" 43 | type: "ReLU" 44 | bottom: "conv1_1" 45 | top: "conv1_1" 46 | } 47 | layer { 48 | name: "conv1_2" 49 | type: "Convolution" 50 | bottom: "conv1_1" 51 | top: "conv1_2" 52 | param { lr_mult: 0 } 53 | param { lr_mult: 0 decay_mult: 0 } 54 | include { stage: "freeze-convnet" } 55 | convolution_param { 56 | num_output: 64 57 | pad: 1 58 | kernel_size: 3 59 | } 60 | } 61 | layer { 62 | name: "conv1_2" 63 | type: "Convolution" 64 | bottom: "conv1_1" 65 | top: "conv1_2" 66 | param { lr_mult: 0.1 } 67 | param { lr_mult: 0.2 decay_mult: 0} 68 | exclude { stage: "freeze-convnet" } 69 | convolution_param { 70 | num_output: 64 71 | pad: 1 72 | kernel_size: 3 73 | } 74 | } 75 | layer { 76 | name: "relu1_2" 77 | type: "ReLU" 78 | bottom: "conv1_2" 79 | top: "conv1_2" 80 | } 81 | layer { 82 | name: "pool1" 83 | type: "Pooling" 84 | bottom: "conv1_2" 85 | top: "pool1" 86 | pooling_param { 87 | pool: MAX 88 | kernel_size: 2 89 | stride: 2 90 | } 91 | } 92 | layer { 93 | name: "conv2_1" 94 | type: "Convolution" 95 | bottom: "pool1" 96 | top: "conv2_1" 97 | param { lr_mult: 0 } 98 | param { lr_mult: 0 decay_mult: 0 } 99 | include { stage: "freeze-convnet" } 100 | convolution_param { 101 | num_output: 128 102 | pad: 1 103 | kernel_size: 3 104 | } 105 | } 106 | layer { 107 | name: "conv2_1" 108 | type: "Convolution" 109 | bottom: "pool1" 110 | top: "conv2_1" 111 | param { lr_mult: 0.1 } 112 | param { lr_mult: 0.2 decay_mult: 0} 113 | exclude { stage: "freeze-convnet" } 114 | convolution_param { 115 | num_output: 128 116 | pad: 1 117 | kernel_size: 3 118 | } 119 | } 120 | layer { 121 | name: "relu2_1" 122 | type: "ReLU" 123 | bottom: "conv2_1" 124 | top: "conv2_1" 125 | } 126 | layer { 127 | name: "conv2_2" 128 | type: "Convolution" 129 | bottom: "conv2_1" 130 | top: "conv2_2" 131 | param { lr_mult: 0 } 132 | param { lr_mult: 0 decay_mult: 0 } 133 | include { stage: "freeze-convnet" } 134 | convolution_param { 135 | num_output: 128 136 | pad: 1 137 | kernel_size: 3 138 | } 139 | } 140 | layer { 141 | name: "conv2_2" 142 | type: "Convolution" 143 | bottom: "conv2_1" 144 | top: "conv2_2" 145 | param { lr_mult: 0.1 } 146 | param { lr_mult: 0.2 decay_mult: 0} 147 | exclude { stage: "freeze-convnet" } 148 | convolution_param { 149 | num_output: 128 150 | pad: 1 151 | kernel_size: 3 152 | } 153 | } 154 | layer { 155 | name: "relu2_2" 156 | type: "ReLU" 157 | bottom: "conv2_2" 158 | top: "conv2_2" 159 | } 160 | layer { 161 | name: "pool2" 162 | type: "Pooling" 163 | bottom: "conv2_2" 164 | top: "pool2" 165 | pooling_param { 166 | pool: MAX 167 | kernel_size: 2 168 | stride: 2 169 | } 170 | } 171 | layer { 172 | name: "conv3_1" 173 | type: "Convolution" 174 | bottom: "pool2" 175 | top: "conv3_1" 176 | param { lr_mult: 0 } 177 | param { lr_mult: 0 decay_mult: 0 } 178 | include { stage: "freeze-convnet" } 179 | convolution_param { 180 | num_output: 256 181 | pad: 1 182 | kernel_size: 3 183 | } 184 | } 185 | layer { 186 | name: "conv3_1" 187 | type: "Convolution" 188 | bottom: "pool2" 189 | top: "conv3_1" 190 | param { lr_mult: 0.1 } 191 | param { lr_mult: 0.2 decay_mult: 0} 192 | exclude { stage: "freeze-convnet" } 193 | convolution_param { 194 | num_output: 256 195 | pad: 1 196 | kernel_size: 3 197 | } 198 | } 199 | layer { 200 | name: "relu3_1" 201 | type: "ReLU" 202 | bottom: "conv3_1" 203 | top: "conv3_1" 204 | } 205 | layer { 206 | name: "conv3_2" 207 | type: "Convolution" 208 | bottom: "conv3_1" 209 | top: "conv3_2" 210 | param { lr_mult: 0 } 211 | param { lr_mult: 0 decay_mult: 0 } 212 | include { stage: "freeze-convnet" } 213 | convolution_param { 214 | num_output: 256 215 | pad: 1 216 | kernel_size: 3 217 | } 218 | } 219 | layer { 220 | name: "conv3_2" 221 | type: "Convolution" 222 | bottom: "conv3_1" 223 | top: "conv3_2" 224 | param { lr_mult: 0.1 } 225 | param { lr_mult: 0.2 decay_mult: 0} 226 | exclude { stage: "freeze-convnet" } 227 | convolution_param { 228 | num_output: 256 229 | pad: 1 230 | kernel_size: 3 231 | } 232 | } 233 | layer { 234 | name: "relu3_2" 235 | type: "ReLU" 236 | bottom: "conv3_2" 237 | top: "conv3_2" 238 | } 239 | layer { 240 | name: "conv3_3" 241 | type: "Convolution" 242 | bottom: "conv3_2" 243 | top: "conv3_3" 244 | param { lr_mult: 0 } 245 | param { lr_mult: 0 decay_mult: 0 } 246 | include { stage: "freeze-convnet" } 247 | convolution_param { 248 | num_output: 256 249 | pad: 1 250 | kernel_size: 3 251 | } 252 | } 253 | layer { 254 | name: "conv3_3" 255 | type: "Convolution" 256 | bottom: "conv3_2" 257 | top: "conv3_3" 258 | param { lr_mult: 0.1 } 259 | param { lr_mult: 0.2 decay_mult: 0} 260 | exclude { stage: "freeze-convnet" } 261 | convolution_param { 262 | num_output: 256 263 | pad: 1 264 | kernel_size: 3 265 | } 266 | } 267 | layer { 268 | name: "relu3_3" 269 | type: "ReLU" 270 | bottom: "conv3_3" 271 | top: "conv3_3" 272 | } 273 | layer { 274 | name: "pool3" 275 | type: "Pooling" 276 | bottom: "conv3_3" 277 | top: "pool3" 278 | pooling_param { 279 | pool: MAX 280 | kernel_size: 2 281 | stride: 2 282 | } 283 | } 284 | layer { 285 | name: "conv4_1" 286 | type: "Convolution" 287 | bottom: "pool3" 288 | top: "conv4_1" 289 | param { lr_mult: 0 } 290 | param { lr_mult: 0 decay_mult: 0 } 291 | include { stage: "freeze-convnet" } 292 | convolution_param { 293 | num_output: 512 294 | pad: 1 295 | kernel_size: 3 296 | } 297 | } 298 | layer { 299 | name: "conv4_1" 300 | type: "Convolution" 301 | bottom: "pool3" 302 | top: "conv4_1" 303 | param { lr_mult: 0.1 } 304 | param { lr_mult: 0.2 decay_mult: 0} 305 | exclude { stage: "freeze-convnet" } 306 | convolution_param { 307 | num_output: 512 308 | pad: 1 309 | kernel_size: 3 310 | } 311 | } 312 | layer { 313 | name: "relu4_1" 314 | type: "ReLU" 315 | bottom: "conv4_1" 316 | top: "conv4_1" 317 | } 318 | layer { 319 | name: "conv4_2" 320 | type: "Convolution" 321 | bottom: "conv4_1" 322 | top: "conv4_2" 323 | param { lr_mult: 0 } 324 | param { lr_mult: 0 decay_mult: 0 } 325 | include { stage: "freeze-convnet" } 326 | convolution_param { 327 | num_output: 512 328 | pad: 1 329 | kernel_size: 3 330 | } 331 | } 332 | layer { 333 | name: "conv4_2" 334 | type: "Convolution" 335 | bottom: "conv4_1" 336 | top: "conv4_2" 337 | param { lr_mult: 0.1 } 338 | param { lr_mult: 0.2 decay_mult: 0} 339 | exclude { stage: "freeze-convnet" } 340 | convolution_param { 341 | num_output: 512 342 | pad: 1 343 | kernel_size: 3 344 | } 345 | } 346 | layer { 347 | name: "relu4_2" 348 | type: "ReLU" 349 | bottom: "conv4_2" 350 | top: "conv4_2" 351 | } 352 | layer { 353 | name: "conv4_3" 354 | type: "Convolution" 355 | bottom: "conv4_2" 356 | top: "conv4_3" 357 | param { lr_mult: 0 } 358 | param { lr_mult: 0 decay_mult: 0 } 359 | include { stage: "freeze-convnet" } 360 | convolution_param { 361 | num_output: 512 362 | pad: 1 363 | kernel_size: 3 364 | } 365 | } 366 | layer { 367 | name: "conv4_3" 368 | type: "Convolution" 369 | bottom: "conv4_2" 370 | top: "conv4_3" 371 | param { lr_mult: 0.1 } 372 | param { lr_mult: 0.2 decay_mult: 0} 373 | exclude { stage: "freeze-convnet" } 374 | convolution_param { 375 | num_output: 512 376 | pad: 1 377 | kernel_size: 3 378 | } 379 | } 380 | layer { 381 | name: "relu4_3" 382 | type: "ReLU" 383 | bottom: "conv4_3" 384 | top: "conv4_3" 385 | } 386 | layer { 387 | name: "pool4" 388 | type: "Pooling" 389 | bottom: "conv4_3" 390 | top: "pool4" 391 | pooling_param { 392 | pool: MAX 393 | kernel_size: 2 394 | stride: 2 395 | } 396 | } 397 | layer { 398 | name: "conv5_1" 399 | type: "Convolution" 400 | bottom: "pool4" 401 | top: "conv5_1" 402 | param { lr_mult: 0 } 403 | param { lr_mult: 0 decay_mult: 0 } 404 | include { stage: "freeze-convnet" } 405 | convolution_param { 406 | num_output: 512 407 | pad: 1 408 | kernel_size: 3 409 | } 410 | } 411 | layer { 412 | name: "conv5_1" 413 | type: "Convolution" 414 | bottom: "pool4" 415 | top: "conv5_1" 416 | param { lr_mult: 0.1 } 417 | param { lr_mult: 0.2 decay_mult: 0} 418 | exclude { stage: "freeze-convnet" } 419 | convolution_param { 420 | num_output: 512 421 | pad: 1 422 | kernel_size: 3 423 | } 424 | } 425 | layer { 426 | name: "relu5_1" 427 | type: "ReLU" 428 | bottom: "conv5_1" 429 | top: "conv5_1" 430 | } 431 | layer { 432 | name: "conv5_2" 433 | type: "Convolution" 434 | bottom: "conv5_1" 435 | top: "conv5_2" 436 | param { lr_mult: 0 } 437 | param { lr_mult: 0 decay_mult: 0 } 438 | include { stage: "freeze-convnet" } 439 | convolution_param { 440 | num_output: 512 441 | pad: 1 442 | kernel_size: 3 443 | } 444 | } 445 | layer { 446 | name: "conv5_2" 447 | type: "Convolution" 448 | bottom: "conv5_1" 449 | top: "conv5_2" 450 | param { lr_mult: 0.1 } 451 | param { lr_mult: 0.2 decay_mult: 0} 452 | exclude { stage: "freeze-convnet" } 453 | convolution_param { 454 | num_output: 512 455 | pad: 1 456 | kernel_size: 3 457 | } 458 | } 459 | layer { 460 | name: "relu5_2" 461 | type: "ReLU" 462 | bottom: "conv5_2" 463 | top: "conv5_2" 464 | } 465 | layer { 466 | name: "conv5_3" 467 | type: "Convolution" 468 | bottom: "conv5_2" 469 | top: "conv5_3" 470 | param { lr_mult: 0 } 471 | param { lr_mult: 0 decay_mult: 0 } 472 | include { stage: "freeze-convnet" } 473 | convolution_param { 474 | num_output: 512 475 | pad: 1 476 | kernel_size: 3 477 | } 478 | } 479 | layer { 480 | name: "conv5_3" 481 | type: "Convolution" 482 | bottom: "conv5_2" 483 | top: "conv5_3" 484 | param { lr_mult: 0.1 } 485 | param { lr_mult: 0.2 decay_mult: 0} 486 | exclude { stage: "freeze-convnet" } 487 | convolution_param { 488 | num_output: 512 489 | pad: 1 490 | kernel_size: 3 491 | } 492 | } 493 | layer { 494 | name: "relu5_3" 495 | type: "ReLU" 496 | bottom: "conv5_3" 497 | top: "conv5_3" 498 | } 499 | layer { 500 | name: "pool5" 501 | type: "Pooling" 502 | bottom: "conv5_3" 503 | top: "pool5" 504 | pooling_param { 505 | pool: MAX 506 | kernel_size: 2 507 | stride: 2 508 | } 509 | } 510 | layer { 511 | name: "fc6" 512 | type: "InnerProduct" 513 | bottom: "pool5" 514 | top: "fc6" 515 | param { lr_mult: 0 } 516 | param { lr_mult: 0 decay_mult: 0 } 517 | include { stage: "freeze-convnet" } 518 | inner_product_param { 519 | num_output: 4096 520 | } 521 | } 522 | layer { 523 | name: "fc6" 524 | type: "InnerProduct" 525 | bottom: "pool5" 526 | top: "fc6" 527 | param { lr_mult: 0.1 } 528 | param { lr_mult: 0.2 decay_mult: 0} 529 | exclude { stage: "freeze-convnet" } 530 | inner_product_param { 531 | num_output: 4096 532 | } 533 | } 534 | layer { 535 | name: "relu6" 536 | type: "ReLU" 537 | bottom: "fc6" 538 | top: "fc6" 539 | } 540 | layer { 541 | name: "drop6" 542 | type: "Dropout" 543 | bottom: "fc6" 544 | top: "fc6" 545 | dropout_param { 546 | dropout_ratio: 0.5 547 | } 548 | } 549 | layer { 550 | name: "fc7" 551 | type: "InnerProduct" 552 | bottom: "fc6" 553 | top: "fc7" 554 | param { lr_mult: 0 } 555 | param { lr_mult: 0 decay_mult: 0 } 556 | include { stage: "freeze-convnet" } 557 | inner_product_param { 558 | num_output: 4096 559 | } 560 | } 561 | layer { 562 | name: "fc7" 563 | type: "InnerProduct" 564 | bottom: "fc6" 565 | top: "fc7" 566 | param { lr_mult: 0.1 } 567 | param { lr_mult: 0.2 decay_mult: 0} 568 | exclude { stage: "freeze-convnet" } 569 | inner_product_param { 570 | num_output: 4096 571 | } 572 | } 573 | layer { 574 | name: "relu7" 575 | type: "ReLU" 576 | bottom: "fc7" 577 | top: "fc7" 578 | } 579 | layer { 580 | name: "drop7" 581 | type: "Dropout" 582 | bottom: "fc7" 583 | top: "fc7" 584 | dropout_param { 585 | dropout_ratio: 0.5 586 | } 587 | } 588 | layer { 589 | name: "fc8" 590 | type: "InnerProduct" 591 | bottom: "fc7" 592 | top: "fc8" 593 | param { 594 | lr_mult: 0.1 595 | decay_mult: 1 596 | } 597 | param { 598 | lr_mult: 0.2 599 | decay_mult: 0 600 | } 601 | inner_product_param { 602 | num_output: 1000 603 | } 604 | } 605 | layer { 606 | name: "embedding" 607 | type: "Embed" 608 | bottom: "input_sentence" 609 | top: "embedded_input_sentence" 610 | param { 611 | lr_mult: 1 612 | } 613 | embed_param { 614 | bias_term: false 615 | input_dim: 8801 616 | num_output: 1000 617 | weight_filler { 618 | type: "uniform" 619 | min: -0.08 620 | max: 0.08 621 | } 622 | } 623 | } 624 | layer { 625 | name: "lstm1" 626 | type: "LSTM" 627 | bottom: "embedded_input_sentence" 628 | bottom: "cont_sentence" 629 | bottom: "fc8" 630 | top: "lstm1" 631 | include { stage: "unfactored" } 632 | recurrent_param { 633 | num_output: 1000 634 | weight_filler { 635 | type: "uniform" 636 | min: -0.08 637 | max: 0.08 638 | } 639 | bias_filler { 640 | type: "constant" 641 | value: 0 642 | } 643 | } 644 | } 645 | layer { 646 | name: "lstm2" 647 | type: "LSTM" 648 | bottom: "lstm1" 649 | bottom: "cont_sentence" 650 | top: "lstm2" 651 | include { 652 | stage: "unfactored" 653 | stage: "2-layer" 654 | } 655 | recurrent_param { 656 | num_output: 1000 657 | weight_filler { 658 | type: "uniform" 659 | min: -0.08 660 | max: 0.08 661 | } 662 | bias_filler { 663 | type: "constant" 664 | value: 0 665 | } 666 | } 667 | } 668 | layer { 669 | name: "lstm1" 670 | type: "LSTM" 671 | bottom: "embedded_input_sentence" 672 | bottom: "cont_sentence" 673 | top: "lstm1" 674 | include { stage: "factored" } 675 | recurrent_param { 676 | num_output: 1000 677 | weight_filler { 678 | type: "uniform" 679 | min: -0.08 680 | max: 0.08 681 | } 682 | bias_filler { 683 | type: "constant" 684 | value: 0 685 | } 686 | } 687 | } 688 | layer { 689 | name: "lstm2" 690 | type: "LSTM" 691 | bottom: "lstm1" 692 | bottom: "cont_sentence" 693 | bottom: "fc8" 694 | top: "lstm2" 695 | include { stage: "factored" } 696 | recurrent_param { 697 | num_output: 1000 698 | weight_filler { 699 | type: "uniform" 700 | min: -0.08 701 | max: 0.08 702 | } 703 | bias_filler { 704 | type: "constant" 705 | value: 0 706 | } 707 | } 708 | } 709 | layer { 710 | name: "predict" 711 | type: "InnerProduct" 712 | bottom: "lstm1" 713 | top: "predict" 714 | param { 715 | lr_mult: 1 716 | decay_mult: 1 717 | } 718 | param { 719 | lr_mult: 2 720 | decay_mult: 0 721 | } 722 | exclude { stage: "2-layer" } 723 | inner_product_param { 724 | num_output: 8801 725 | weight_filler { 726 | type: "uniform" 727 | min: -0.08 728 | max: 0.08 729 | } 730 | bias_filler { 731 | type: "constant" 732 | value: 0 733 | } 734 | axis: 2 735 | } 736 | } 737 | layer { 738 | name: "predict" 739 | type: "InnerProduct" 740 | bottom: "lstm2" 741 | top: "predict" 742 | param { 743 | lr_mult: 1 744 | decay_mult: 1 745 | } 746 | param { 747 | lr_mult: 2 748 | decay_mult: 0 749 | } 750 | include { stage: "2-layer" } 751 | inner_product_param { 752 | num_output: 8801 753 | weight_filler { 754 | type: "uniform" 755 | min: -0.08 756 | max: 0.08 757 | } 758 | bias_filler { 759 | type: "constant" 760 | value: 0 761 | } 762 | axis: 2 763 | } 764 | } 765 | -------------------------------------------------------------------------------- /prototxt/scrc_full_vgg_buffer_50.prototxt: -------------------------------------------------------------------------------- 1 | state { 2 | phase: TRAIN level: 0 3 | stage: 'freeze-convnet' stage: 'factored' stage: '2-layer' 4 | } 5 | 6 | # train data layers 7 | layer { 8 | name: "data" 9 | type: "ImageData" 10 | top: "data" 11 | top: "label" 12 | transform_param { 13 | crop_size: 224 14 | mean_value: 104 15 | mean_value: 117 16 | mean_value: 123 17 | } 18 | image_data_param { 19 | source: "./data/training/train_bbox_context_imcrop_list.txt" 20 | batch_size: 50 21 | } 22 | } 23 | layer { 24 | name: "data" 25 | type: "HDF5Data" 26 | top: "cont_sentence" 27 | top: "input_sentence" 28 | top: "target_sentence" 29 | hdf5_data_param { 30 | source: "./data/training/train_bbox_context_hdf5_text_list.txt" 31 | batch_size: 20 32 | } 33 | } 34 | layer { 35 | name: "data" 36 | type: "HDF5Data" 37 | top: "bbox_coordinate" 38 | top: "fc7_context" 39 | hdf5_data_param { 40 | source: "./data/training/train_bbox_context_hdf5_bbox_list.txt" 41 | batch_size: 50 42 | } 43 | } 44 | 45 | layer { 46 | name: "silence" 47 | type: "Silence" 48 | bottom: "label" 49 | } 50 | layer { 51 | name: "conv1_1" 52 | type: "Convolution" 53 | bottom: "data" 54 | top: "conv1_1" 55 | param { lr_mult: 0 } 56 | param { lr_mult: 0 decay_mult: 0 } 57 | include { stage: "freeze-convnet" } 58 | convolution_param { 59 | num_output: 64 60 | pad: 1 61 | kernel_size: 3 62 | } 63 | } 64 | layer { 65 | name: "conv1_1" 66 | type: "Convolution" 67 | bottom: "data" 68 | top: "conv1_1" 69 | param { lr_mult: 0.1 } 70 | param { lr_mult: 0.2 decay_mult: 0} 71 | exclude { stage: "freeze-convnet" } 72 | convolution_param { 73 | num_output: 64 74 | pad: 1 75 | kernel_size: 3 76 | } 77 | } 78 | layer { 79 | name: "relu1_1" 80 | type: "ReLU" 81 | bottom: "conv1_1" 82 | top: "conv1_1" 83 | } 84 | layer { 85 | name: "conv1_2" 86 | type: "Convolution" 87 | bottom: "conv1_1" 88 | top: "conv1_2" 89 | param { lr_mult: 0 } 90 | param { lr_mult: 0 decay_mult: 0 } 91 | include { stage: "freeze-convnet" } 92 | convolution_param { 93 | num_output: 64 94 | pad: 1 95 | kernel_size: 3 96 | } 97 | } 98 | layer { 99 | name: "conv1_2" 100 | type: "Convolution" 101 | bottom: "conv1_1" 102 | top: "conv1_2" 103 | param { lr_mult: 0.1 } 104 | param { lr_mult: 0.2 decay_mult: 0} 105 | exclude { stage: "freeze-convnet" } 106 | convolution_param { 107 | num_output: 64 108 | pad: 1 109 | kernel_size: 3 110 | } 111 | } 112 | layer { 113 | name: "relu1_2" 114 | type: "ReLU" 115 | bottom: "conv1_2" 116 | top: "conv1_2" 117 | } 118 | layer { 119 | name: "pool1" 120 | type: "Pooling" 121 | bottom: "conv1_2" 122 | top: "pool1" 123 | pooling_param { 124 | pool: MAX 125 | kernel_size: 2 126 | stride: 2 127 | } 128 | } 129 | layer { 130 | name: "conv2_1" 131 | type: "Convolution" 132 | bottom: "pool1" 133 | top: "conv2_1" 134 | param { lr_mult: 0 } 135 | param { lr_mult: 0 decay_mult: 0 } 136 | include { stage: "freeze-convnet" } 137 | convolution_param { 138 | num_output: 128 139 | pad: 1 140 | kernel_size: 3 141 | } 142 | } 143 | layer { 144 | name: "conv2_1" 145 | type: "Convolution" 146 | bottom: "pool1" 147 | top: "conv2_1" 148 | param { lr_mult: 0.1 } 149 | param { lr_mult: 0.2 decay_mult: 0} 150 | exclude { stage: "freeze-convnet" } 151 | convolution_param { 152 | num_output: 128 153 | pad: 1 154 | kernel_size: 3 155 | } 156 | } 157 | layer { 158 | name: "relu2_1" 159 | type: "ReLU" 160 | bottom: "conv2_1" 161 | top: "conv2_1" 162 | } 163 | layer { 164 | name: "conv2_2" 165 | type: "Convolution" 166 | bottom: "conv2_1" 167 | top: "conv2_2" 168 | param { lr_mult: 0 } 169 | param { lr_mult: 0 decay_mult: 0 } 170 | include { stage: "freeze-convnet" } 171 | convolution_param { 172 | num_output: 128 173 | pad: 1 174 | kernel_size: 3 175 | } 176 | } 177 | layer { 178 | name: "conv2_2" 179 | type: "Convolution" 180 | bottom: "conv2_1" 181 | top: "conv2_2" 182 | param { lr_mult: 0.1 } 183 | param { lr_mult: 0.2 decay_mult: 0} 184 | exclude { stage: "freeze-convnet" } 185 | convolution_param { 186 | num_output: 128 187 | pad: 1 188 | kernel_size: 3 189 | } 190 | } 191 | layer { 192 | name: "relu2_2" 193 | type: "ReLU" 194 | bottom: "conv2_2" 195 | top: "conv2_2" 196 | } 197 | layer { 198 | name: "pool2" 199 | type: "Pooling" 200 | bottom: "conv2_2" 201 | top: "pool2" 202 | pooling_param { 203 | pool: MAX 204 | kernel_size: 2 205 | stride: 2 206 | } 207 | } 208 | layer { 209 | name: "conv3_1" 210 | type: "Convolution" 211 | bottom: "pool2" 212 | top: "conv3_1" 213 | param { lr_mult: 0 } 214 | param { lr_mult: 0 decay_mult: 0 } 215 | include { stage: "freeze-convnet" } 216 | convolution_param { 217 | num_output: 256 218 | pad: 1 219 | kernel_size: 3 220 | } 221 | } 222 | layer { 223 | name: "conv3_1" 224 | type: "Convolution" 225 | bottom: "pool2" 226 | top: "conv3_1" 227 | param { lr_mult: 0.1 } 228 | param { lr_mult: 0.2 decay_mult: 0} 229 | exclude { stage: "freeze-convnet" } 230 | convolution_param { 231 | num_output: 256 232 | pad: 1 233 | kernel_size: 3 234 | } 235 | } 236 | layer { 237 | name: "relu3_1" 238 | type: "ReLU" 239 | bottom: "conv3_1" 240 | top: "conv3_1" 241 | } 242 | layer { 243 | name: "conv3_2" 244 | type: "Convolution" 245 | bottom: "conv3_1" 246 | top: "conv3_2" 247 | param { lr_mult: 0 } 248 | param { lr_mult: 0 decay_mult: 0 } 249 | include { stage: "freeze-convnet" } 250 | convolution_param { 251 | num_output: 256 252 | pad: 1 253 | kernel_size: 3 254 | } 255 | } 256 | layer { 257 | name: "conv3_2" 258 | type: "Convolution" 259 | bottom: "conv3_1" 260 | top: "conv3_2" 261 | param { lr_mult: 0.1 } 262 | param { lr_mult: 0.2 decay_mult: 0} 263 | exclude { stage: "freeze-convnet" } 264 | convolution_param { 265 | num_output: 256 266 | pad: 1 267 | kernel_size: 3 268 | } 269 | } 270 | layer { 271 | name: "relu3_2" 272 | type: "ReLU" 273 | bottom: "conv3_2" 274 | top: "conv3_2" 275 | } 276 | layer { 277 | name: "conv3_3" 278 | type: "Convolution" 279 | bottom: "conv3_2" 280 | top: "conv3_3" 281 | param { lr_mult: 0 } 282 | param { lr_mult: 0 decay_mult: 0 } 283 | include { stage: "freeze-convnet" } 284 | convolution_param { 285 | num_output: 256 286 | pad: 1 287 | kernel_size: 3 288 | } 289 | } 290 | layer { 291 | name: "conv3_3" 292 | type: "Convolution" 293 | bottom: "conv3_2" 294 | top: "conv3_3" 295 | param { lr_mult: 0.1 } 296 | param { lr_mult: 0.2 decay_mult: 0} 297 | exclude { stage: "freeze-convnet" } 298 | convolution_param { 299 | num_output: 256 300 | pad: 1 301 | kernel_size: 3 302 | } 303 | } 304 | layer { 305 | name: "relu3_3" 306 | type: "ReLU" 307 | bottom: "conv3_3" 308 | top: "conv3_3" 309 | } 310 | layer { 311 | name: "pool3" 312 | type: "Pooling" 313 | bottom: "conv3_3" 314 | top: "pool3" 315 | pooling_param { 316 | pool: MAX 317 | kernel_size: 2 318 | stride: 2 319 | } 320 | } 321 | layer { 322 | name: "conv4_1" 323 | type: "Convolution" 324 | bottom: "pool3" 325 | top: "conv4_1" 326 | param { lr_mult: 0 } 327 | param { lr_mult: 0 decay_mult: 0 } 328 | include { stage: "freeze-convnet" } 329 | convolution_param { 330 | num_output: 512 331 | pad: 1 332 | kernel_size: 3 333 | } 334 | } 335 | layer { 336 | name: "conv4_1" 337 | type: "Convolution" 338 | bottom: "pool3" 339 | top: "conv4_1" 340 | param { lr_mult: 0.1 } 341 | param { lr_mult: 0.2 decay_mult: 0} 342 | exclude { stage: "freeze-convnet" } 343 | convolution_param { 344 | num_output: 512 345 | pad: 1 346 | kernel_size: 3 347 | } 348 | } 349 | layer { 350 | name: "relu4_1" 351 | type: "ReLU" 352 | bottom: "conv4_1" 353 | top: "conv4_1" 354 | } 355 | layer { 356 | name: "conv4_2" 357 | type: "Convolution" 358 | bottom: "conv4_1" 359 | top: "conv4_2" 360 | param { lr_mult: 0 } 361 | param { lr_mult: 0 decay_mult: 0 } 362 | include { stage: "freeze-convnet" } 363 | convolution_param { 364 | num_output: 512 365 | pad: 1 366 | kernel_size: 3 367 | } 368 | } 369 | layer { 370 | name: "conv4_2" 371 | type: "Convolution" 372 | bottom: "conv4_1" 373 | top: "conv4_2" 374 | param { lr_mult: 0.1 } 375 | param { lr_mult: 0.2 decay_mult: 0} 376 | exclude { stage: "freeze-convnet" } 377 | convolution_param { 378 | num_output: 512 379 | pad: 1 380 | kernel_size: 3 381 | } 382 | } 383 | layer { 384 | name: "relu4_2" 385 | type: "ReLU" 386 | bottom: "conv4_2" 387 | top: "conv4_2" 388 | } 389 | layer { 390 | name: "conv4_3" 391 | type: "Convolution" 392 | bottom: "conv4_2" 393 | top: "conv4_3" 394 | param { lr_mult: 0 } 395 | param { lr_mult: 0 decay_mult: 0 } 396 | include { stage: "freeze-convnet" } 397 | convolution_param { 398 | num_output: 512 399 | pad: 1 400 | kernel_size: 3 401 | } 402 | } 403 | layer { 404 | name: "conv4_3" 405 | type: "Convolution" 406 | bottom: "conv4_2" 407 | top: "conv4_3" 408 | param { lr_mult: 0.1 } 409 | param { lr_mult: 0.2 decay_mult: 0} 410 | exclude { stage: "freeze-convnet" } 411 | convolution_param { 412 | num_output: 512 413 | pad: 1 414 | kernel_size: 3 415 | } 416 | } 417 | layer { 418 | name: "relu4_3" 419 | type: "ReLU" 420 | bottom: "conv4_3" 421 | top: "conv4_3" 422 | } 423 | layer { 424 | name: "pool4" 425 | type: "Pooling" 426 | bottom: "conv4_3" 427 | top: "pool4" 428 | pooling_param { 429 | pool: MAX 430 | kernel_size: 2 431 | stride: 2 432 | } 433 | } 434 | layer { 435 | name: "conv5_1" 436 | type: "Convolution" 437 | bottom: "pool4" 438 | top: "conv5_1" 439 | param { lr_mult: 0 } 440 | param { lr_mult: 0 decay_mult: 0 } 441 | include { stage: "freeze-convnet" } 442 | convolution_param { 443 | num_output: 512 444 | pad: 1 445 | kernel_size: 3 446 | } 447 | } 448 | layer { 449 | name: "conv5_1" 450 | type: "Convolution" 451 | bottom: "pool4" 452 | top: "conv5_1" 453 | param { lr_mult: 0.1 } 454 | param { lr_mult: 0.2 decay_mult: 0} 455 | exclude { stage: "freeze-convnet" } 456 | convolution_param { 457 | num_output: 512 458 | pad: 1 459 | kernel_size: 3 460 | } 461 | } 462 | layer { 463 | name: "relu5_1" 464 | type: "ReLU" 465 | bottom: "conv5_1" 466 | top: "conv5_1" 467 | } 468 | layer { 469 | name: "conv5_2" 470 | type: "Convolution" 471 | bottom: "conv5_1" 472 | top: "conv5_2" 473 | param { lr_mult: 0 } 474 | param { lr_mult: 0 decay_mult: 0 } 475 | include { stage: "freeze-convnet" } 476 | convolution_param { 477 | num_output: 512 478 | pad: 1 479 | kernel_size: 3 480 | } 481 | } 482 | layer { 483 | name: "conv5_2" 484 | type: "Convolution" 485 | bottom: "conv5_1" 486 | top: "conv5_2" 487 | param { lr_mult: 0.1 } 488 | param { lr_mult: 0.2 decay_mult: 0} 489 | exclude { stage: "freeze-convnet" } 490 | convolution_param { 491 | num_output: 512 492 | pad: 1 493 | kernel_size: 3 494 | } 495 | } 496 | layer { 497 | name: "relu5_2" 498 | type: "ReLU" 499 | bottom: "conv5_2" 500 | top: "conv5_2" 501 | } 502 | layer { 503 | name: "conv5_3" 504 | type: "Convolution" 505 | bottom: "conv5_2" 506 | top: "conv5_3" 507 | param { lr_mult: 0 } 508 | param { lr_mult: 0 decay_mult: 0 } 509 | include { stage: "freeze-convnet" } 510 | convolution_param { 511 | num_output: 512 512 | pad: 1 513 | kernel_size: 3 514 | } 515 | } 516 | layer { 517 | name: "conv5_3" 518 | type: "Convolution" 519 | bottom: "conv5_2" 520 | top: "conv5_3" 521 | param { lr_mult: 0.1 } 522 | param { lr_mult: 0.2 decay_mult: 0} 523 | exclude { stage: "freeze-convnet" } 524 | convolution_param { 525 | num_output: 512 526 | pad: 1 527 | kernel_size: 3 528 | } 529 | } 530 | layer { 531 | name: "relu5_3" 532 | type: "ReLU" 533 | bottom: "conv5_3" 534 | top: "conv5_3" 535 | } 536 | layer { 537 | name: "pool5" 538 | type: "Pooling" 539 | bottom: "conv5_3" 540 | top: "pool5" 541 | pooling_param { 542 | pool: MAX 543 | kernel_size: 2 544 | stride: 2 545 | } 546 | } 547 | layer { 548 | name: "fc6" 549 | type: "InnerProduct" 550 | bottom: "pool5" 551 | top: "fc6" 552 | param { lr_mult: 0 } 553 | param { lr_mult: 0 decay_mult: 0 } 554 | include { stage: "freeze-convnet" } 555 | inner_product_param { 556 | num_output: 4096 557 | } 558 | } 559 | layer { 560 | name: "fc6" 561 | type: "InnerProduct" 562 | bottom: "pool5" 563 | top: "fc6" 564 | param { lr_mult: 0.1 } 565 | param { lr_mult: 0.2 decay_mult: 0} 566 | exclude { stage: "freeze-convnet" } 567 | inner_product_param { 568 | num_output: 4096 569 | } 570 | } 571 | layer { 572 | name: "relu6" 573 | type: "ReLU" 574 | bottom: "fc6" 575 | top: "fc6" 576 | } 577 | layer { 578 | name: "drop6" 579 | type: "Dropout" 580 | bottom: "fc6" 581 | top: "fc6" 582 | dropout_param { 583 | dropout_ratio: 0.5 584 | } 585 | } 586 | layer { 587 | name: "fc7" 588 | type: "InnerProduct" 589 | bottom: "fc6" 590 | top: "fc7" 591 | param { lr_mult: 0 } 592 | param { lr_mult: 0 decay_mult: 0 } 593 | include { stage: "freeze-convnet" } 594 | inner_product_param { 595 | num_output: 4096 596 | } 597 | } 598 | layer { 599 | name: "fc7" 600 | type: "InnerProduct" 601 | bottom: "fc6" 602 | top: "fc7" 603 | param { lr_mult: 0.1 } 604 | param { lr_mult: 0.2 decay_mult: 0} 605 | exclude { stage: "freeze-convnet" } 606 | inner_product_param { 607 | num_output: 4096 608 | } 609 | } 610 | layer { 611 | name: "relu7" 612 | type: "ReLU" 613 | bottom: "fc7" 614 | top: "fc7" 615 | } 616 | layer { 617 | name: "drop7" 618 | type: "Dropout" 619 | bottom: "fc7" 620 | top: "fc7" 621 | dropout_param { 622 | dropout_ratio: 0.5 623 | } 624 | } 625 | layer { 626 | name: "fc8" 627 | type: "InnerProduct" 628 | bottom: "fc7" 629 | top: "fc8" 630 | param { 631 | lr_mult: 0.1 632 | decay_mult: 1 633 | } 634 | param { 635 | lr_mult: 0.2 636 | decay_mult: 0 637 | } 638 | inner_product_param { 639 | num_output: 1000 640 | } 641 | } 642 | layer { 643 | name: "local_features" 644 | type: "Concat" 645 | bottom: "fc8" 646 | bottom: "bbox_coordinate" 647 | top: "local_features" 648 | } 649 | 650 | layer { 651 | name: "embedding" 652 | type: "Embed" 653 | bottom: "input_sentence" 654 | top: "embedded_input_sentence" 655 | param { 656 | lr_mult: 1 657 | } 658 | embed_param { 659 | bias_term: false 660 | input_dim: 8801 661 | num_output: 1000 662 | weight_filler { 663 | type: "uniform" 664 | min: -0.08 665 | max: 0.08 666 | } 667 | } 668 | } 669 | layer { 670 | name: "lstm1" 671 | type: "LSTM" 672 | bottom: "embedded_input_sentence" 673 | bottom: "cont_sentence" 674 | bottom: "local_features" 675 | top: "lstm1" 676 | include { stage: "unfactored" } 677 | recurrent_param { 678 | num_output: 1000 679 | weight_filler { 680 | type: "uniform" 681 | min: -0.08 682 | max: 0.08 683 | } 684 | bias_filler { 685 | type: "constant" 686 | value: 0 687 | } 688 | } 689 | } 690 | layer { 691 | name: "lstm2" 692 | type: "LSTM" 693 | bottom: "lstm1" 694 | bottom: "cont_sentence" 695 | top: "lstm2" 696 | include { 697 | stage: "unfactored" 698 | stage: "2-layer" 699 | } 700 | recurrent_param { 701 | num_output: 1000 702 | weight_filler { 703 | type: "uniform" 704 | min: -0.08 705 | max: 0.08 706 | } 707 | bias_filler { 708 | type: "constant" 709 | value: 0 710 | } 711 | } 712 | } 713 | layer { 714 | name: "lstm1" 715 | type: "LSTM" 716 | bottom: "embedded_input_sentence" 717 | bottom: "cont_sentence" 718 | top: "lstm1" 719 | include { stage: "factored" } 720 | recurrent_param { 721 | num_output: 1000 722 | weight_filler { 723 | type: "uniform" 724 | min: -0.08 725 | max: 0.08 726 | } 727 | bias_filler { 728 | type: "constant" 729 | value: 0 730 | } 731 | } 732 | } 733 | layer { 734 | name: "lstm2-extended" 735 | type: "LSTM" 736 | bottom: "lstm1" 737 | bottom: "cont_sentence" 738 | bottom: "local_features" 739 | top: "lstm2" 740 | include { stage: "factored" } 741 | recurrent_param { 742 | num_output: 1000 743 | weight_filler { 744 | type: "uniform" 745 | min: -0.08 746 | max: 0.08 747 | } 748 | bias_filler { 749 | type: "constant" 750 | value: 0 751 | } 752 | } 753 | } 754 | layer { 755 | name: "predict" 756 | type: "InnerProduct" 757 | bottom: "lstm1" 758 | top: "predict" 759 | param { 760 | lr_mult: 1 761 | decay_mult: 1 762 | } 763 | param { 764 | lr_mult: 2 765 | decay_mult: 0 766 | } 767 | exclude { stage: "2-layer" } 768 | inner_product_param { 769 | num_output: 8801 770 | weight_filler { 771 | type: "uniform" 772 | min: -0.08 773 | max: 0.08 774 | } 775 | bias_filler { 776 | type: "constant" 777 | value: 0 778 | } 779 | axis: 2 780 | } 781 | } 782 | layer { 783 | name: "predict" 784 | type: "InnerProduct" 785 | bottom: "lstm2" 786 | top: "predict" 787 | param { 788 | lr_mult: 1 789 | decay_mult: 1 790 | } 791 | param { 792 | lr_mult: 2 793 | decay_mult: 0 794 | } 795 | include { stage: "2-layer" } 796 | inner_product_param { 797 | num_output: 8801 798 | weight_filler { 799 | type: "uniform" 800 | min: -0.08 801 | max: 0.08 802 | } 803 | bias_filler { 804 | type: "constant" 805 | value: 0 806 | } 807 | axis: 2 808 | } 809 | } 810 | 811 | # Context LSTM 812 | layer { 813 | name: "fc8_context" 814 | type: "InnerProduct" 815 | bottom: "fc7_context" 816 | top: "fc8_context" 817 | param { 818 | lr_mult: 0.1 819 | decay_mult: 1 820 | } 821 | param { 822 | lr_mult: 0.2 823 | decay_mult: 0 824 | } 825 | inner_product_param { 826 | num_output: 1000 827 | } 828 | } 829 | layer { 830 | name: "lstm2_context" 831 | type: "LSTM" 832 | bottom: "lstm1" 833 | bottom: "cont_sentence" 834 | bottom: "fc8_context" 835 | top: "lstm2_context" 836 | include { stage: "factored" } 837 | recurrent_param { 838 | num_output: 1000 839 | weight_filler { 840 | type: "uniform" 841 | min: -0.08 842 | max: 0.08 843 | } 844 | bias_filler { 845 | type: "constant" 846 | value: 0 847 | } 848 | } 849 | } 850 | layer { 851 | name: "predict_context" 852 | type: "InnerProduct" 853 | bottom: "lstm2_context" 854 | top: "predict_context" 855 | param { 856 | lr_mult: 1 857 | decay_mult: 1 858 | } 859 | param { 860 | lr_mult: 2 861 | decay_mult: 0 862 | } 863 | include { stage: "2-layer" } 864 | inner_product_param { 865 | num_output: 8801 866 | weight_filler { 867 | type: "constant" 868 | value: 0 869 | } 870 | bias_filler { 871 | type: "constant" 872 | value: 0 873 | } 874 | axis: 2 875 | } 876 | } 877 | layer { 878 | name: "predict_combined" 879 | type: "Eltwise" 880 | bottom: "predict" 881 | bottom: "predict_context" 882 | top: "predict_combined" 883 | eltwise_param { operation: SUM } 884 | } 885 | 886 | layer { 887 | name: "cross_entropy_loss" 888 | type: "SoftmaxWithLoss" 889 | bottom: "predict_combined" 890 | bottom: "target_sentence" 891 | top: "cross_entropy_loss" 892 | loss_weight: 20 893 | loss_param { 894 | ignore_label: -1 895 | } 896 | softmax_param { 897 | axis: 2 898 | } 899 | } 900 | layer { 901 | name: "accuracy" 902 | type: "Accuracy" 903 | bottom: "predict" 904 | bottom: "target_sentence" 905 | top: "accuracy" 906 | include { phase: TEST } 907 | accuracy_param { 908 | axis: 2 909 | ignore_label: -1 910 | } 911 | } 912 | -------------------------------------------------------------------------------- /prototxt/scrc_full_vgg_solver.prototxt: -------------------------------------------------------------------------------- 1 | net: "./prototxt/scrc_full_vgg_buffer_50.prototxt" 2 | 3 | train_state: { stage: 'freeze-convnet' stage: 'factored' stage: '2-layer' } 4 | base_lr: 0.001 5 | lr_policy: "step" 6 | gamma: 0.5 7 | stepsize: 30000 8 | display: 1 9 | max_iter: 90000 10 | momentum: 0.9 11 | weight_decay: 0.0000 12 | snapshot: 10000 13 | snapshot_prefix: "./exp-referit/caffemodel/scrc_full_vgg" 14 | solver_mode: GPU 15 | random_seed: 1701 16 | average_loss: 100 17 | clip_gradients: 10 18 | -------------------------------------------------------------------------------- /prototxt/scrc_kitchen_buffer_50.prototxt: -------------------------------------------------------------------------------- 1 | state { 2 | phase: TRAIN level: 0 3 | stage: 'freeze-convnet' stage: 'factored' stage: '2-layer' 4 | } 5 | 6 | # train data layers 7 | layer { 8 | name: "data" 9 | type: "ImageData" 10 | top: "data" 11 | top: "label" 12 | transform_param { 13 | mirror: true 14 | crop_size: 224 15 | mean_value: 104 16 | mean_value: 117 17 | mean_value: 123 18 | } 19 | image_data_param { 20 | source: "./data/kitchen_train_image_list.txt" 21 | batch_size: 50 22 | new_height: 256 23 | new_width: 256 24 | } 25 | } 26 | layer { 27 | name: "data" 28 | type: "HDF5Data" 29 | top: "cont_sentence" 30 | top: "input_sentence" 31 | top: "target_sentence" 32 | hdf5_data_param { 33 | source: "./data/kitchen_train_hdf5_list.txt" 34 | batch_size: 20 35 | } 36 | } 37 | 38 | layer { 39 | name: "silence" 40 | type: "Silence" 41 | bottom: "label" 42 | } 43 | layer { 44 | name: "conv1_1" 45 | type: "Convolution" 46 | bottom: "data" 47 | top: "conv1_1" 48 | param { lr_mult: 0 } 49 | param { lr_mult: 0 decay_mult: 0 } 50 | include { stage: "freeze-convnet" } 51 | convolution_param { 52 | num_output: 64 53 | pad: 1 54 | kernel_size: 3 55 | } 56 | } 57 | layer { 58 | name: "conv1_1" 59 | type: "Convolution" 60 | bottom: "data" 61 | top: "conv1_1" 62 | param { lr_mult: 0.1 } 63 | param { lr_mult: 0.2 decay_mult: 0} 64 | exclude { stage: "freeze-convnet" } 65 | convolution_param { 66 | num_output: 64 67 | pad: 1 68 | kernel_size: 3 69 | } 70 | } 71 | layer { 72 | name: "relu1_1" 73 | type: "ReLU" 74 | bottom: "conv1_1" 75 | top: "conv1_1" 76 | } 77 | layer { 78 | name: "conv1_2" 79 | type: "Convolution" 80 | bottom: "conv1_1" 81 | top: "conv1_2" 82 | param { lr_mult: 0 } 83 | param { lr_mult: 0 decay_mult: 0 } 84 | include { stage: "freeze-convnet" } 85 | convolution_param { 86 | num_output: 64 87 | pad: 1 88 | kernel_size: 3 89 | } 90 | } 91 | layer { 92 | name: "conv1_2" 93 | type: "Convolution" 94 | bottom: "conv1_1" 95 | top: "conv1_2" 96 | param { lr_mult: 0.1 } 97 | param { lr_mult: 0.2 decay_mult: 0} 98 | exclude { stage: "freeze-convnet" } 99 | convolution_param { 100 | num_output: 64 101 | pad: 1 102 | kernel_size: 3 103 | } 104 | } 105 | layer { 106 | name: "relu1_2" 107 | type: "ReLU" 108 | bottom: "conv1_2" 109 | top: "conv1_2" 110 | } 111 | layer { 112 | name: "pool1" 113 | type: "Pooling" 114 | bottom: "conv1_2" 115 | top: "pool1" 116 | pooling_param { 117 | pool: MAX 118 | kernel_size: 2 119 | stride: 2 120 | } 121 | } 122 | layer { 123 | name: "conv2_1" 124 | type: "Convolution" 125 | bottom: "pool1" 126 | top: "conv2_1" 127 | param { lr_mult: 0 } 128 | param { lr_mult: 0 decay_mult: 0 } 129 | include { stage: "freeze-convnet" } 130 | convolution_param { 131 | num_output: 128 132 | pad: 1 133 | kernel_size: 3 134 | } 135 | } 136 | layer { 137 | name: "conv2_1" 138 | type: "Convolution" 139 | bottom: "pool1" 140 | top: "conv2_1" 141 | param { lr_mult: 0.1 } 142 | param { lr_mult: 0.2 decay_mult: 0} 143 | exclude { stage: "freeze-convnet" } 144 | convolution_param { 145 | num_output: 128 146 | pad: 1 147 | kernel_size: 3 148 | } 149 | } 150 | layer { 151 | name: "relu2_1" 152 | type: "ReLU" 153 | bottom: "conv2_1" 154 | top: "conv2_1" 155 | } 156 | layer { 157 | name: "conv2_2" 158 | type: "Convolution" 159 | bottom: "conv2_1" 160 | top: "conv2_2" 161 | param { lr_mult: 0 } 162 | param { lr_mult: 0 decay_mult: 0 } 163 | include { stage: "freeze-convnet" } 164 | convolution_param { 165 | num_output: 128 166 | pad: 1 167 | kernel_size: 3 168 | } 169 | } 170 | layer { 171 | name: "conv2_2" 172 | type: "Convolution" 173 | bottom: "conv2_1" 174 | top: "conv2_2" 175 | param { lr_mult: 0.1 } 176 | param { lr_mult: 0.2 decay_mult: 0} 177 | exclude { stage: "freeze-convnet" } 178 | convolution_param { 179 | num_output: 128 180 | pad: 1 181 | kernel_size: 3 182 | } 183 | } 184 | layer { 185 | name: "relu2_2" 186 | type: "ReLU" 187 | bottom: "conv2_2" 188 | top: "conv2_2" 189 | } 190 | layer { 191 | name: "pool2" 192 | type: "Pooling" 193 | bottom: "conv2_2" 194 | top: "pool2" 195 | pooling_param { 196 | pool: MAX 197 | kernel_size: 2 198 | stride: 2 199 | } 200 | } 201 | layer { 202 | name: "conv3_1" 203 | type: "Convolution" 204 | bottom: "pool2" 205 | top: "conv3_1" 206 | param { lr_mult: 0 } 207 | param { lr_mult: 0 decay_mult: 0 } 208 | include { stage: "freeze-convnet" } 209 | convolution_param { 210 | num_output: 256 211 | pad: 1 212 | kernel_size: 3 213 | } 214 | } 215 | layer { 216 | name: "conv3_1" 217 | type: "Convolution" 218 | bottom: "pool2" 219 | top: "conv3_1" 220 | param { lr_mult: 0.1 } 221 | param { lr_mult: 0.2 decay_mult: 0} 222 | exclude { stage: "freeze-convnet" } 223 | convolution_param { 224 | num_output: 256 225 | pad: 1 226 | kernel_size: 3 227 | } 228 | } 229 | layer { 230 | name: "relu3_1" 231 | type: "ReLU" 232 | bottom: "conv3_1" 233 | top: "conv3_1" 234 | } 235 | layer { 236 | name: "conv3_2" 237 | type: "Convolution" 238 | bottom: "conv3_1" 239 | top: "conv3_2" 240 | param { lr_mult: 0 } 241 | param { lr_mult: 0 decay_mult: 0 } 242 | include { stage: "freeze-convnet" } 243 | convolution_param { 244 | num_output: 256 245 | pad: 1 246 | kernel_size: 3 247 | } 248 | } 249 | layer { 250 | name: "conv3_2" 251 | type: "Convolution" 252 | bottom: "conv3_1" 253 | top: "conv3_2" 254 | param { lr_mult: 0.1 } 255 | param { lr_mult: 0.2 decay_mult: 0} 256 | exclude { stage: "freeze-convnet" } 257 | convolution_param { 258 | num_output: 256 259 | pad: 1 260 | kernel_size: 3 261 | } 262 | } 263 | layer { 264 | name: "relu3_2" 265 | type: "ReLU" 266 | bottom: "conv3_2" 267 | top: "conv3_2" 268 | } 269 | layer { 270 | name: "conv3_3" 271 | type: "Convolution" 272 | bottom: "conv3_2" 273 | top: "conv3_3" 274 | param { lr_mult: 0 } 275 | param { lr_mult: 0 decay_mult: 0 } 276 | include { stage: "freeze-convnet" } 277 | convolution_param { 278 | num_output: 256 279 | pad: 1 280 | kernel_size: 3 281 | } 282 | } 283 | layer { 284 | name: "conv3_3" 285 | type: "Convolution" 286 | bottom: "conv3_2" 287 | top: "conv3_3" 288 | param { lr_mult: 0.1 } 289 | param { lr_mult: 0.2 decay_mult: 0} 290 | exclude { stage: "freeze-convnet" } 291 | convolution_param { 292 | num_output: 256 293 | pad: 1 294 | kernel_size: 3 295 | } 296 | } 297 | layer { 298 | name: "relu3_3" 299 | type: "ReLU" 300 | bottom: "conv3_3" 301 | top: "conv3_3" 302 | } 303 | layer { 304 | name: "pool3" 305 | type: "Pooling" 306 | bottom: "conv3_3" 307 | top: "pool3" 308 | pooling_param { 309 | pool: MAX 310 | kernel_size: 2 311 | stride: 2 312 | } 313 | } 314 | layer { 315 | name: "conv4_1" 316 | type: "Convolution" 317 | bottom: "pool3" 318 | top: "conv4_1" 319 | param { lr_mult: 0 } 320 | param { lr_mult: 0 decay_mult: 0 } 321 | include { stage: "freeze-convnet" } 322 | convolution_param { 323 | num_output: 512 324 | pad: 1 325 | kernel_size: 3 326 | } 327 | } 328 | layer { 329 | name: "conv4_1" 330 | type: "Convolution" 331 | bottom: "pool3" 332 | top: "conv4_1" 333 | param { lr_mult: 0.1 } 334 | param { lr_mult: 0.2 decay_mult: 0} 335 | exclude { stage: "freeze-convnet" } 336 | convolution_param { 337 | num_output: 512 338 | pad: 1 339 | kernel_size: 3 340 | } 341 | } 342 | layer { 343 | name: "relu4_1" 344 | type: "ReLU" 345 | bottom: "conv4_1" 346 | top: "conv4_1" 347 | } 348 | layer { 349 | name: "conv4_2" 350 | type: "Convolution" 351 | bottom: "conv4_1" 352 | top: "conv4_2" 353 | param { lr_mult: 0 } 354 | param { lr_mult: 0 decay_mult: 0 } 355 | include { stage: "freeze-convnet" } 356 | convolution_param { 357 | num_output: 512 358 | pad: 1 359 | kernel_size: 3 360 | } 361 | } 362 | layer { 363 | name: "conv4_2" 364 | type: "Convolution" 365 | bottom: "conv4_1" 366 | top: "conv4_2" 367 | param { lr_mult: 0.1 } 368 | param { lr_mult: 0.2 decay_mult: 0} 369 | exclude { stage: "freeze-convnet" } 370 | convolution_param { 371 | num_output: 512 372 | pad: 1 373 | kernel_size: 3 374 | } 375 | } 376 | layer { 377 | name: "relu4_2" 378 | type: "ReLU" 379 | bottom: "conv4_2" 380 | top: "conv4_2" 381 | } 382 | layer { 383 | name: "conv4_3" 384 | type: "Convolution" 385 | bottom: "conv4_2" 386 | top: "conv4_3" 387 | param { lr_mult: 0 } 388 | param { lr_mult: 0 decay_mult: 0 } 389 | include { stage: "freeze-convnet" } 390 | convolution_param { 391 | num_output: 512 392 | pad: 1 393 | kernel_size: 3 394 | } 395 | } 396 | layer { 397 | name: "conv4_3" 398 | type: "Convolution" 399 | bottom: "conv4_2" 400 | top: "conv4_3" 401 | param { lr_mult: 0.1 } 402 | param { lr_mult: 0.2 decay_mult: 0} 403 | exclude { stage: "freeze-convnet" } 404 | convolution_param { 405 | num_output: 512 406 | pad: 1 407 | kernel_size: 3 408 | } 409 | } 410 | layer { 411 | name: "relu4_3" 412 | type: "ReLU" 413 | bottom: "conv4_3" 414 | top: "conv4_3" 415 | } 416 | layer { 417 | name: "pool4" 418 | type: "Pooling" 419 | bottom: "conv4_3" 420 | top: "pool4" 421 | pooling_param { 422 | pool: MAX 423 | kernel_size: 2 424 | stride: 2 425 | } 426 | } 427 | layer { 428 | name: "conv5_1" 429 | type: "Convolution" 430 | bottom: "pool4" 431 | top: "conv5_1" 432 | param { lr_mult: 0 } 433 | param { lr_mult: 0 decay_mult: 0 } 434 | include { stage: "freeze-convnet" } 435 | convolution_param { 436 | num_output: 512 437 | pad: 1 438 | kernel_size: 3 439 | } 440 | } 441 | layer { 442 | name: "conv5_1" 443 | type: "Convolution" 444 | bottom: "pool4" 445 | top: "conv5_1" 446 | param { lr_mult: 0.1 } 447 | param { lr_mult: 0.2 decay_mult: 0} 448 | exclude { stage: "freeze-convnet" } 449 | convolution_param { 450 | num_output: 512 451 | pad: 1 452 | kernel_size: 3 453 | } 454 | } 455 | layer { 456 | name: "relu5_1" 457 | type: "ReLU" 458 | bottom: "conv5_1" 459 | top: "conv5_1" 460 | } 461 | layer { 462 | name: "conv5_2" 463 | type: "Convolution" 464 | bottom: "conv5_1" 465 | top: "conv5_2" 466 | param { lr_mult: 0 } 467 | param { lr_mult: 0 decay_mult: 0 } 468 | include { stage: "freeze-convnet" } 469 | convolution_param { 470 | num_output: 512 471 | pad: 1 472 | kernel_size: 3 473 | } 474 | } 475 | layer { 476 | name: "conv5_2" 477 | type: "Convolution" 478 | bottom: "conv5_1" 479 | top: "conv5_2" 480 | param { lr_mult: 0.1 } 481 | param { lr_mult: 0.2 decay_mult: 0} 482 | exclude { stage: "freeze-convnet" } 483 | convolution_param { 484 | num_output: 512 485 | pad: 1 486 | kernel_size: 3 487 | } 488 | } 489 | layer { 490 | name: "relu5_2" 491 | type: "ReLU" 492 | bottom: "conv5_2" 493 | top: "conv5_2" 494 | } 495 | layer { 496 | name: "conv5_3" 497 | type: "Convolution" 498 | bottom: "conv5_2" 499 | top: "conv5_3" 500 | param { lr_mult: 0 } 501 | param { lr_mult: 0 decay_mult: 0 } 502 | include { stage: "freeze-convnet" } 503 | convolution_param { 504 | num_output: 512 505 | pad: 1 506 | kernel_size: 3 507 | } 508 | } 509 | layer { 510 | name: "conv5_3" 511 | type: "Convolution" 512 | bottom: "conv5_2" 513 | top: "conv5_3" 514 | param { lr_mult: 0.1 } 515 | param { lr_mult: 0.2 decay_mult: 0} 516 | exclude { stage: "freeze-convnet" } 517 | convolution_param { 518 | num_output: 512 519 | pad: 1 520 | kernel_size: 3 521 | } 522 | } 523 | layer { 524 | name: "relu5_3" 525 | type: "ReLU" 526 | bottom: "conv5_3" 527 | top: "conv5_3" 528 | } 529 | layer { 530 | name: "pool5" 531 | type: "Pooling" 532 | bottom: "conv5_3" 533 | top: "pool5" 534 | pooling_param { 535 | pool: MAX 536 | kernel_size: 2 537 | stride: 2 538 | } 539 | } 540 | layer { 541 | name: "fc6" 542 | type: "InnerProduct" 543 | bottom: "pool5" 544 | top: "fc6" 545 | param { lr_mult: 0 } 546 | param { lr_mult: 0 decay_mult: 0 } 547 | include { stage: "freeze-convnet" } 548 | inner_product_param { 549 | num_output: 4096 550 | } 551 | } 552 | layer { 553 | name: "fc6" 554 | type: "InnerProduct" 555 | bottom: "pool5" 556 | top: "fc6" 557 | param { lr_mult: 0.1 } 558 | param { lr_mult: 0.2 decay_mult: 0} 559 | exclude { stage: "freeze-convnet" } 560 | inner_product_param { 561 | num_output: 4096 562 | } 563 | } 564 | layer { 565 | name: "relu6" 566 | type: "ReLU" 567 | bottom: "fc6" 568 | top: "fc6" 569 | } 570 | layer { 571 | name: "drop6" 572 | type: "Dropout" 573 | bottom: "fc6" 574 | top: "fc6" 575 | dropout_param { 576 | dropout_ratio: 0.5 577 | } 578 | } 579 | layer { 580 | name: "fc7" 581 | type: "InnerProduct" 582 | bottom: "fc6" 583 | top: "fc7" 584 | param { lr_mult: 0 } 585 | param { lr_mult: 0 decay_mult: 0 } 586 | include { stage: "freeze-convnet" } 587 | inner_product_param { 588 | num_output: 4096 589 | } 590 | } 591 | layer { 592 | name: "fc7" 593 | type: "InnerProduct" 594 | bottom: "fc6" 595 | top: "fc7" 596 | param { lr_mult: 0.1 } 597 | param { lr_mult: 0.2 decay_mult: 0} 598 | exclude { stage: "freeze-convnet" } 599 | inner_product_param { 600 | num_output: 4096 601 | } 602 | } 603 | layer { 604 | name: "relu7" 605 | type: "ReLU" 606 | bottom: "fc7" 607 | top: "fc7" 608 | } 609 | layer { 610 | name: "drop7" 611 | type: "Dropout" 612 | bottom: "fc7" 613 | top: "fc7" 614 | dropout_param { 615 | dropout_ratio: 0.5 616 | } 617 | } 618 | layer { 619 | name: "fc8" 620 | type: "InnerProduct" 621 | bottom: "fc7" 622 | top: "fc8" 623 | param { 624 | lr_mult: 0.1 625 | decay_mult: 1 626 | } 627 | param { 628 | lr_mult: 0.2 629 | decay_mult: 0 630 | } 631 | inner_product_param { 632 | num_output: 1000 633 | } 634 | } 635 | layer { 636 | name: "embedding" 637 | type: "Embed" 638 | bottom: "input_sentence" 639 | top: "embedded_input_sentence" 640 | param { 641 | lr_mult: 1 642 | } 643 | embed_param { 644 | bias_term: false 645 | input_dim: 8801 646 | num_output: 1000 647 | weight_filler { 648 | type: "uniform" 649 | min: -0.08 650 | max: 0.08 651 | } 652 | } 653 | } 654 | layer { 655 | name: "lstm1" 656 | type: "LSTM" 657 | bottom: "embedded_input_sentence" 658 | bottom: "cont_sentence" 659 | bottom: "fc8" 660 | top: "lstm1" 661 | include { stage: "unfactored" } 662 | recurrent_param { 663 | num_output: 1000 664 | weight_filler { 665 | type: "uniform" 666 | min: -0.08 667 | max: 0.08 668 | } 669 | bias_filler { 670 | type: "constant" 671 | value: 0 672 | } 673 | } 674 | } 675 | layer { 676 | name: "lstm2" 677 | type: "LSTM" 678 | bottom: "lstm1" 679 | bottom: "cont_sentence" 680 | top: "lstm2" 681 | include { 682 | stage: "unfactored" 683 | stage: "2-layer" 684 | } 685 | recurrent_param { 686 | num_output: 1000 687 | weight_filler { 688 | type: "uniform" 689 | min: -0.08 690 | max: 0.08 691 | } 692 | bias_filler { 693 | type: "constant" 694 | value: 0 695 | } 696 | } 697 | } 698 | layer { 699 | name: "lstm1" 700 | type: "LSTM" 701 | bottom: "embedded_input_sentence" 702 | bottom: "cont_sentence" 703 | top: "lstm1" 704 | include { stage: "factored" } 705 | recurrent_param { 706 | num_output: 1000 707 | weight_filler { 708 | type: "uniform" 709 | min: -0.08 710 | max: 0.08 711 | } 712 | bias_filler { 713 | type: "constant" 714 | value: 0 715 | } 716 | } 717 | } 718 | layer { 719 | name: "lstm2" 720 | type: "LSTM" 721 | bottom: "lstm1" 722 | bottom: "cont_sentence" 723 | bottom: "fc8" 724 | top: "lstm2" 725 | include { stage: "factored" } 726 | recurrent_param { 727 | num_output: 1000 728 | weight_filler { 729 | type: "uniform" 730 | min: -0.08 731 | max: 0.08 732 | } 733 | bias_filler { 734 | type: "constant" 735 | value: 0 736 | } 737 | } 738 | } 739 | layer { 740 | name: "predict" 741 | type: "InnerProduct" 742 | bottom: "lstm1" 743 | top: "predict" 744 | param { 745 | lr_mult: 1 746 | decay_mult: 1 747 | } 748 | param { 749 | lr_mult: 2 750 | decay_mult: 0 751 | } 752 | exclude { stage: "2-layer" } 753 | inner_product_param { 754 | num_output: 8801 755 | weight_filler { 756 | type: "uniform" 757 | min: -0.08 758 | max: 0.08 759 | } 760 | bias_filler { 761 | type: "constant" 762 | value: 0 763 | } 764 | axis: 2 765 | } 766 | } 767 | layer { 768 | name: "predict" 769 | type: "InnerProduct" 770 | bottom: "lstm2" 771 | top: "predict" 772 | param { 773 | lr_mult: 1 774 | decay_mult: 1 775 | } 776 | param { 777 | lr_mult: 2 778 | decay_mult: 0 779 | } 780 | include { stage: "2-layer" } 781 | inner_product_param { 782 | num_output: 8801 783 | weight_filler { 784 | type: "uniform" 785 | min: -0.08 786 | max: 0.08 787 | } 788 | bias_filler { 789 | type: "constant" 790 | value: 0 791 | } 792 | axis: 2 793 | } 794 | } 795 | layer { 796 | name: "cross_entropy_loss" 797 | type: "SoftmaxWithLoss" 798 | bottom: "predict" 799 | bottom: "target_sentence" 800 | top: "cross_entropy_loss" 801 | loss_weight: 20 802 | loss_param { 803 | ignore_label: -1 804 | } 805 | softmax_param { 806 | axis: 2 807 | } 808 | } 809 | layer { 810 | name: "accuracy" 811 | type: "Accuracy" 812 | bottom: "predict" 813 | bottom: "target_sentence" 814 | top: "accuracy" 815 | include { phase: TEST } 816 | accuracy_param { 817 | axis: 2 818 | ignore_label: -1 819 | } 820 | } 821 | -------------------------------------------------------------------------------- /prototxt/scrc_kitchen_solver.prototxt: -------------------------------------------------------------------------------- 1 | net: "prototxt/scrc_kitchen_buffer_50.prototxt" 2 | 3 | train_state: { stage: 'freeze-convnet' stage: 'factored' stage: '2-layer' } 4 | base_lr: 0.001 5 | lr_policy: "step" 6 | gamma: 0.5 7 | stepsize: 1000 8 | display: 1 9 | max_iter: 3000 10 | momentum: 0.9 11 | weight_decay: 0.0000 12 | snapshot: 3000 13 | snapshot_prefix: "./exp-kitchen/caffemodel/scrc_kitchen" 14 | solver_mode: GPU 15 | random_seed: 1701 16 | average_loss: 100 17 | clip_gradients: 10 18 | -------------------------------------------------------------------------------- /prototxt/scrc_no_context_vgg_buffer_50.prototxt: -------------------------------------------------------------------------------- 1 | state { 2 | phase: TRAIN level: 0 3 | stage: 'freeze-convnet' stage: 'factored' stage: '2-layer' 4 | } 5 | 6 | # train data layers 7 | layer { 8 | name: "data" 9 | type: "ImageData" 10 | top: "data" 11 | top: "label" 12 | transform_param { 13 | crop_size: 224 14 | mean_value: 104 15 | mean_value: 117 16 | mean_value: 123 17 | } 18 | image_data_param { 19 | source: "./data/training/train_bbox_context_imcrop_list.txt" 20 | batch_size: 50 21 | } 22 | } 23 | layer { 24 | name: "data" 25 | type: "HDF5Data" 26 | top: "cont_sentence" 27 | top: "input_sentence" 28 | top: "target_sentence" 29 | hdf5_data_param { 30 | source: "./data/training/train_bbox_context_hdf5_text_list.txt" 31 | batch_size: 20 32 | } 33 | } 34 | layer { 35 | name: "data" 36 | type: "HDF5Data" 37 | top: "bbox_coordinate" 38 | hdf5_data_param { 39 | source: "./data/training/train_bbox_context_hdf5_bbox_list.txt" 40 | batch_size: 50 41 | } 42 | } 43 | 44 | layer { 45 | name: "silence" 46 | type: "Silence" 47 | bottom: "label" 48 | } 49 | layer { 50 | name: "conv1_1" 51 | type: "Convolution" 52 | bottom: "data" 53 | top: "conv1_1" 54 | param { lr_mult: 0 } 55 | param { lr_mult: 0 decay_mult: 0 } 56 | include { stage: "freeze-convnet" } 57 | convolution_param { 58 | num_output: 64 59 | pad: 1 60 | kernel_size: 3 61 | } 62 | } 63 | layer { 64 | name: "conv1_1" 65 | type: "Convolution" 66 | bottom: "data" 67 | top: "conv1_1" 68 | param { lr_mult: 0.1 } 69 | param { lr_mult: 0.2 decay_mult: 0} 70 | exclude { stage: "freeze-convnet" } 71 | convolution_param { 72 | num_output: 64 73 | pad: 1 74 | kernel_size: 3 75 | } 76 | } 77 | layer { 78 | name: "relu1_1" 79 | type: "ReLU" 80 | bottom: "conv1_1" 81 | top: "conv1_1" 82 | } 83 | layer { 84 | name: "conv1_2" 85 | type: "Convolution" 86 | bottom: "conv1_1" 87 | top: "conv1_2" 88 | param { lr_mult: 0 } 89 | param { lr_mult: 0 decay_mult: 0 } 90 | include { stage: "freeze-convnet" } 91 | convolution_param { 92 | num_output: 64 93 | pad: 1 94 | kernel_size: 3 95 | } 96 | } 97 | layer { 98 | name: "conv1_2" 99 | type: "Convolution" 100 | bottom: "conv1_1" 101 | top: "conv1_2" 102 | param { lr_mult: 0.1 } 103 | param { lr_mult: 0.2 decay_mult: 0} 104 | exclude { stage: "freeze-convnet" } 105 | convolution_param { 106 | num_output: 64 107 | pad: 1 108 | kernel_size: 3 109 | } 110 | } 111 | layer { 112 | name: "relu1_2" 113 | type: "ReLU" 114 | bottom: "conv1_2" 115 | top: "conv1_2" 116 | } 117 | layer { 118 | name: "pool1" 119 | type: "Pooling" 120 | bottom: "conv1_2" 121 | top: "pool1" 122 | pooling_param { 123 | pool: MAX 124 | kernel_size: 2 125 | stride: 2 126 | } 127 | } 128 | layer { 129 | name: "conv2_1" 130 | type: "Convolution" 131 | bottom: "pool1" 132 | top: "conv2_1" 133 | param { lr_mult: 0 } 134 | param { lr_mult: 0 decay_mult: 0 } 135 | include { stage: "freeze-convnet" } 136 | convolution_param { 137 | num_output: 128 138 | pad: 1 139 | kernel_size: 3 140 | } 141 | } 142 | layer { 143 | name: "conv2_1" 144 | type: "Convolution" 145 | bottom: "pool1" 146 | top: "conv2_1" 147 | param { lr_mult: 0.1 } 148 | param { lr_mult: 0.2 decay_mult: 0} 149 | exclude { stage: "freeze-convnet" } 150 | convolution_param { 151 | num_output: 128 152 | pad: 1 153 | kernel_size: 3 154 | } 155 | } 156 | layer { 157 | name: "relu2_1" 158 | type: "ReLU" 159 | bottom: "conv2_1" 160 | top: "conv2_1" 161 | } 162 | layer { 163 | name: "conv2_2" 164 | type: "Convolution" 165 | bottom: "conv2_1" 166 | top: "conv2_2" 167 | param { lr_mult: 0 } 168 | param { lr_mult: 0 decay_mult: 0 } 169 | include { stage: "freeze-convnet" } 170 | convolution_param { 171 | num_output: 128 172 | pad: 1 173 | kernel_size: 3 174 | } 175 | } 176 | layer { 177 | name: "conv2_2" 178 | type: "Convolution" 179 | bottom: "conv2_1" 180 | top: "conv2_2" 181 | param { lr_mult: 0.1 } 182 | param { lr_mult: 0.2 decay_mult: 0} 183 | exclude { stage: "freeze-convnet" } 184 | convolution_param { 185 | num_output: 128 186 | pad: 1 187 | kernel_size: 3 188 | } 189 | } 190 | layer { 191 | name: "relu2_2" 192 | type: "ReLU" 193 | bottom: "conv2_2" 194 | top: "conv2_2" 195 | } 196 | layer { 197 | name: "pool2" 198 | type: "Pooling" 199 | bottom: "conv2_2" 200 | top: "pool2" 201 | pooling_param { 202 | pool: MAX 203 | kernel_size: 2 204 | stride: 2 205 | } 206 | } 207 | layer { 208 | name: "conv3_1" 209 | type: "Convolution" 210 | bottom: "pool2" 211 | top: "conv3_1" 212 | param { lr_mult: 0 } 213 | param { lr_mult: 0 decay_mult: 0 } 214 | include { stage: "freeze-convnet" } 215 | convolution_param { 216 | num_output: 256 217 | pad: 1 218 | kernel_size: 3 219 | } 220 | } 221 | layer { 222 | name: "conv3_1" 223 | type: "Convolution" 224 | bottom: "pool2" 225 | top: "conv3_1" 226 | param { lr_mult: 0.1 } 227 | param { lr_mult: 0.2 decay_mult: 0} 228 | exclude { stage: "freeze-convnet" } 229 | convolution_param { 230 | num_output: 256 231 | pad: 1 232 | kernel_size: 3 233 | } 234 | } 235 | layer { 236 | name: "relu3_1" 237 | type: "ReLU" 238 | bottom: "conv3_1" 239 | top: "conv3_1" 240 | } 241 | layer { 242 | name: "conv3_2" 243 | type: "Convolution" 244 | bottom: "conv3_1" 245 | top: "conv3_2" 246 | param { lr_mult: 0 } 247 | param { lr_mult: 0 decay_mult: 0 } 248 | include { stage: "freeze-convnet" } 249 | convolution_param { 250 | num_output: 256 251 | pad: 1 252 | kernel_size: 3 253 | } 254 | } 255 | layer { 256 | name: "conv3_2" 257 | type: "Convolution" 258 | bottom: "conv3_1" 259 | top: "conv3_2" 260 | param { lr_mult: 0.1 } 261 | param { lr_mult: 0.2 decay_mult: 0} 262 | exclude { stage: "freeze-convnet" } 263 | convolution_param { 264 | num_output: 256 265 | pad: 1 266 | kernel_size: 3 267 | } 268 | } 269 | layer { 270 | name: "relu3_2" 271 | type: "ReLU" 272 | bottom: "conv3_2" 273 | top: "conv3_2" 274 | } 275 | layer { 276 | name: "conv3_3" 277 | type: "Convolution" 278 | bottom: "conv3_2" 279 | top: "conv3_3" 280 | param { lr_mult: 0 } 281 | param { lr_mult: 0 decay_mult: 0 } 282 | include { stage: "freeze-convnet" } 283 | convolution_param { 284 | num_output: 256 285 | pad: 1 286 | kernel_size: 3 287 | } 288 | } 289 | layer { 290 | name: "conv3_3" 291 | type: "Convolution" 292 | bottom: "conv3_2" 293 | top: "conv3_3" 294 | param { lr_mult: 0.1 } 295 | param { lr_mult: 0.2 decay_mult: 0} 296 | exclude { stage: "freeze-convnet" } 297 | convolution_param { 298 | num_output: 256 299 | pad: 1 300 | kernel_size: 3 301 | } 302 | } 303 | layer { 304 | name: "relu3_3" 305 | type: "ReLU" 306 | bottom: "conv3_3" 307 | top: "conv3_3" 308 | } 309 | layer { 310 | name: "pool3" 311 | type: "Pooling" 312 | bottom: "conv3_3" 313 | top: "pool3" 314 | pooling_param { 315 | pool: MAX 316 | kernel_size: 2 317 | stride: 2 318 | } 319 | } 320 | layer { 321 | name: "conv4_1" 322 | type: "Convolution" 323 | bottom: "pool3" 324 | top: "conv4_1" 325 | param { lr_mult: 0 } 326 | param { lr_mult: 0 decay_mult: 0 } 327 | include { stage: "freeze-convnet" } 328 | convolution_param { 329 | num_output: 512 330 | pad: 1 331 | kernel_size: 3 332 | } 333 | } 334 | layer { 335 | name: "conv4_1" 336 | type: "Convolution" 337 | bottom: "pool3" 338 | top: "conv4_1" 339 | param { lr_mult: 0.1 } 340 | param { lr_mult: 0.2 decay_mult: 0} 341 | exclude { stage: "freeze-convnet" } 342 | convolution_param { 343 | num_output: 512 344 | pad: 1 345 | kernel_size: 3 346 | } 347 | } 348 | layer { 349 | name: "relu4_1" 350 | type: "ReLU" 351 | bottom: "conv4_1" 352 | top: "conv4_1" 353 | } 354 | layer { 355 | name: "conv4_2" 356 | type: "Convolution" 357 | bottom: "conv4_1" 358 | top: "conv4_2" 359 | param { lr_mult: 0 } 360 | param { lr_mult: 0 decay_mult: 0 } 361 | include { stage: "freeze-convnet" } 362 | convolution_param { 363 | num_output: 512 364 | pad: 1 365 | kernel_size: 3 366 | } 367 | } 368 | layer { 369 | name: "conv4_2" 370 | type: "Convolution" 371 | bottom: "conv4_1" 372 | top: "conv4_2" 373 | param { lr_mult: 0.1 } 374 | param { lr_mult: 0.2 decay_mult: 0} 375 | exclude { stage: "freeze-convnet" } 376 | convolution_param { 377 | num_output: 512 378 | pad: 1 379 | kernel_size: 3 380 | } 381 | } 382 | layer { 383 | name: "relu4_2" 384 | type: "ReLU" 385 | bottom: "conv4_2" 386 | top: "conv4_2" 387 | } 388 | layer { 389 | name: "conv4_3" 390 | type: "Convolution" 391 | bottom: "conv4_2" 392 | top: "conv4_3" 393 | param { lr_mult: 0 } 394 | param { lr_mult: 0 decay_mult: 0 } 395 | include { stage: "freeze-convnet" } 396 | convolution_param { 397 | num_output: 512 398 | pad: 1 399 | kernel_size: 3 400 | } 401 | } 402 | layer { 403 | name: "conv4_3" 404 | type: "Convolution" 405 | bottom: "conv4_2" 406 | top: "conv4_3" 407 | param { lr_mult: 0.1 } 408 | param { lr_mult: 0.2 decay_mult: 0} 409 | exclude { stage: "freeze-convnet" } 410 | convolution_param { 411 | num_output: 512 412 | pad: 1 413 | kernel_size: 3 414 | } 415 | } 416 | layer { 417 | name: "relu4_3" 418 | type: "ReLU" 419 | bottom: "conv4_3" 420 | top: "conv4_3" 421 | } 422 | layer { 423 | name: "pool4" 424 | type: "Pooling" 425 | bottom: "conv4_3" 426 | top: "pool4" 427 | pooling_param { 428 | pool: MAX 429 | kernel_size: 2 430 | stride: 2 431 | } 432 | } 433 | layer { 434 | name: "conv5_1" 435 | type: "Convolution" 436 | bottom: "pool4" 437 | top: "conv5_1" 438 | param { lr_mult: 0 } 439 | param { lr_mult: 0 decay_mult: 0 } 440 | include { stage: "freeze-convnet" } 441 | convolution_param { 442 | num_output: 512 443 | pad: 1 444 | kernel_size: 3 445 | } 446 | } 447 | layer { 448 | name: "conv5_1" 449 | type: "Convolution" 450 | bottom: "pool4" 451 | top: "conv5_1" 452 | param { lr_mult: 0.1 } 453 | param { lr_mult: 0.2 decay_mult: 0} 454 | exclude { stage: "freeze-convnet" } 455 | convolution_param { 456 | num_output: 512 457 | pad: 1 458 | kernel_size: 3 459 | } 460 | } 461 | layer { 462 | name: "relu5_1" 463 | type: "ReLU" 464 | bottom: "conv5_1" 465 | top: "conv5_1" 466 | } 467 | layer { 468 | name: "conv5_2" 469 | type: "Convolution" 470 | bottom: "conv5_1" 471 | top: "conv5_2" 472 | param { lr_mult: 0 } 473 | param { lr_mult: 0 decay_mult: 0 } 474 | include { stage: "freeze-convnet" } 475 | convolution_param { 476 | num_output: 512 477 | pad: 1 478 | kernel_size: 3 479 | } 480 | } 481 | layer { 482 | name: "conv5_2" 483 | type: "Convolution" 484 | bottom: "conv5_1" 485 | top: "conv5_2" 486 | param { lr_mult: 0.1 } 487 | param { lr_mult: 0.2 decay_mult: 0} 488 | exclude { stage: "freeze-convnet" } 489 | convolution_param { 490 | num_output: 512 491 | pad: 1 492 | kernel_size: 3 493 | } 494 | } 495 | layer { 496 | name: "relu5_2" 497 | type: "ReLU" 498 | bottom: "conv5_2" 499 | top: "conv5_2" 500 | } 501 | layer { 502 | name: "conv5_3" 503 | type: "Convolution" 504 | bottom: "conv5_2" 505 | top: "conv5_3" 506 | param { lr_mult: 0 } 507 | param { lr_mult: 0 decay_mult: 0 } 508 | include { stage: "freeze-convnet" } 509 | convolution_param { 510 | num_output: 512 511 | pad: 1 512 | kernel_size: 3 513 | } 514 | } 515 | layer { 516 | name: "conv5_3" 517 | type: "Convolution" 518 | bottom: "conv5_2" 519 | top: "conv5_3" 520 | param { lr_mult: 0.1 } 521 | param { lr_mult: 0.2 decay_mult: 0} 522 | exclude { stage: "freeze-convnet" } 523 | convolution_param { 524 | num_output: 512 525 | pad: 1 526 | kernel_size: 3 527 | } 528 | } 529 | layer { 530 | name: "relu5_3" 531 | type: "ReLU" 532 | bottom: "conv5_3" 533 | top: "conv5_3" 534 | } 535 | layer { 536 | name: "pool5" 537 | type: "Pooling" 538 | bottom: "conv5_3" 539 | top: "pool5" 540 | pooling_param { 541 | pool: MAX 542 | kernel_size: 2 543 | stride: 2 544 | } 545 | } 546 | layer { 547 | name: "fc6" 548 | type: "InnerProduct" 549 | bottom: "pool5" 550 | top: "fc6" 551 | param { lr_mult: 0 } 552 | param { lr_mult: 0 decay_mult: 0 } 553 | include { stage: "freeze-convnet" } 554 | inner_product_param { 555 | num_output: 4096 556 | } 557 | } 558 | layer { 559 | name: "fc6" 560 | type: "InnerProduct" 561 | bottom: "pool5" 562 | top: "fc6" 563 | param { lr_mult: 0.1 } 564 | param { lr_mult: 0.2 decay_mult: 0} 565 | exclude { stage: "freeze-convnet" } 566 | inner_product_param { 567 | num_output: 4096 568 | } 569 | } 570 | layer { 571 | name: "relu6" 572 | type: "ReLU" 573 | bottom: "fc6" 574 | top: "fc6" 575 | } 576 | layer { 577 | name: "drop6" 578 | type: "Dropout" 579 | bottom: "fc6" 580 | top: "fc6" 581 | dropout_param { 582 | dropout_ratio: 0.5 583 | } 584 | } 585 | layer { 586 | name: "fc7" 587 | type: "InnerProduct" 588 | bottom: "fc6" 589 | top: "fc7" 590 | param { lr_mult: 0 } 591 | param { lr_mult: 0 decay_mult: 0 } 592 | include { stage: "freeze-convnet" } 593 | inner_product_param { 594 | num_output: 4096 595 | } 596 | } 597 | layer { 598 | name: "fc7" 599 | type: "InnerProduct" 600 | bottom: "fc6" 601 | top: "fc7" 602 | param { lr_mult: 0.1 } 603 | param { lr_mult: 0.2 decay_mult: 0} 604 | exclude { stage: "freeze-convnet" } 605 | inner_product_param { 606 | num_output: 4096 607 | } 608 | } 609 | layer { 610 | name: "relu7" 611 | type: "ReLU" 612 | bottom: "fc7" 613 | top: "fc7" 614 | } 615 | layer { 616 | name: "drop7" 617 | type: "Dropout" 618 | bottom: "fc7" 619 | top: "fc7" 620 | dropout_param { 621 | dropout_ratio: 0.5 622 | } 623 | } 624 | layer { 625 | name: "fc8" 626 | type: "InnerProduct" 627 | bottom: "fc7" 628 | top: "fc8" 629 | param { 630 | lr_mult: 0.1 631 | decay_mult: 1 632 | } 633 | param { 634 | lr_mult: 0.2 635 | decay_mult: 0 636 | } 637 | inner_product_param { 638 | num_output: 1000 639 | } 640 | } 641 | layer { 642 | name: "local_features" 643 | type: "Concat" 644 | bottom: "fc8" 645 | bottom: "bbox_coordinate" 646 | top: "local_features" 647 | } 648 | 649 | layer { 650 | name: "embedding" 651 | type: "Embed" 652 | bottom: "input_sentence" 653 | top: "embedded_input_sentence" 654 | param { 655 | lr_mult: 1 656 | } 657 | embed_param { 658 | bias_term: false 659 | input_dim: 8801 660 | num_output: 1000 661 | weight_filler { 662 | type: "uniform" 663 | min: -0.08 664 | max: 0.08 665 | } 666 | } 667 | } 668 | layer { 669 | name: "lstm1" 670 | type: "LSTM" 671 | bottom: "embedded_input_sentence" 672 | bottom: "cont_sentence" 673 | bottom: "local_features" 674 | top: "lstm1" 675 | include { stage: "unfactored" } 676 | recurrent_param { 677 | num_output: 1000 678 | weight_filler { 679 | type: "uniform" 680 | min: -0.08 681 | max: 0.08 682 | } 683 | bias_filler { 684 | type: "constant" 685 | value: 0 686 | } 687 | } 688 | } 689 | layer { 690 | name: "lstm2" 691 | type: "LSTM" 692 | bottom: "lstm1" 693 | bottom: "cont_sentence" 694 | top: "lstm2" 695 | include { 696 | stage: "unfactored" 697 | stage: "2-layer" 698 | } 699 | recurrent_param { 700 | num_output: 1000 701 | weight_filler { 702 | type: "uniform" 703 | min: -0.08 704 | max: 0.08 705 | } 706 | bias_filler { 707 | type: "constant" 708 | value: 0 709 | } 710 | } 711 | } 712 | layer { 713 | name: "lstm1" 714 | type: "LSTM" 715 | bottom: "embedded_input_sentence" 716 | bottom: "cont_sentence" 717 | top: "lstm1" 718 | include { stage: "factored" } 719 | recurrent_param { 720 | num_output: 1000 721 | weight_filler { 722 | type: "uniform" 723 | min: -0.08 724 | max: 0.08 725 | } 726 | bias_filler { 727 | type: "constant" 728 | value: 0 729 | } 730 | } 731 | } 732 | layer { 733 | name: "lstm2-extended" 734 | type: "LSTM" 735 | bottom: "lstm1" 736 | bottom: "cont_sentence" 737 | bottom: "local_features" 738 | top: "lstm2" 739 | include { stage: "factored" } 740 | recurrent_param { 741 | num_output: 1000 742 | weight_filler { 743 | type: "uniform" 744 | min: -0.08 745 | max: 0.08 746 | } 747 | bias_filler { 748 | type: "constant" 749 | value: 0 750 | } 751 | } 752 | } 753 | layer { 754 | name: "predict" 755 | type: "InnerProduct" 756 | bottom: "lstm1" 757 | top: "predict" 758 | param { 759 | lr_mult: 1 760 | decay_mult: 1 761 | } 762 | param { 763 | lr_mult: 2 764 | decay_mult: 0 765 | } 766 | exclude { stage: "2-layer" } 767 | inner_product_param { 768 | num_output: 8801 769 | weight_filler { 770 | type: "uniform" 771 | min: -0.08 772 | max: 0.08 773 | } 774 | bias_filler { 775 | type: "constant" 776 | value: 0 777 | } 778 | axis: 2 779 | } 780 | } 781 | layer { 782 | name: "predict" 783 | type: "InnerProduct" 784 | bottom: "lstm2" 785 | top: "predict" 786 | param { 787 | lr_mult: 1 788 | decay_mult: 1 789 | } 790 | param { 791 | lr_mult: 2 792 | decay_mult: 0 793 | } 794 | include { stage: "2-layer" } 795 | inner_product_param { 796 | num_output: 8801 797 | weight_filler { 798 | type: "uniform" 799 | min: -0.08 800 | max: 0.08 801 | } 802 | bias_filler { 803 | type: "constant" 804 | value: 0 805 | } 806 | axis: 2 807 | } 808 | } 809 | layer { 810 | name: "cross_entropy_loss" 811 | type: "SoftmaxWithLoss" 812 | bottom: "predict" 813 | bottom: "target_sentence" 814 | top: "cross_entropy_loss" 815 | loss_weight: 20 816 | loss_param { 817 | ignore_label: -1 818 | } 819 | softmax_param { 820 | axis: 2 821 | } 822 | } 823 | layer { 824 | name: "accuracy" 825 | type: "Accuracy" 826 | bottom: "predict" 827 | bottom: "target_sentence" 828 | top: "accuracy" 829 | include { phase: TEST } 830 | accuracy_param { 831 | axis: 2 832 | ignore_label: -1 833 | } 834 | } 835 | -------------------------------------------------------------------------------- /prototxt/scrc_no_context_vgg_solver.prototxt: -------------------------------------------------------------------------------- 1 | net: "./prototxt/scrc_no_context_vgg_buffer_50.prototxt" 2 | 3 | train_state: { stage: 'freeze-convnet' stage: 'factored' stage: '2-layer' } 4 | base_lr: 0.001 5 | lr_policy: "step" 6 | gamma: 0.5 7 | stepsize: 30000 8 | display: 1 9 | max_iter: 90000 10 | momentum: 0.9 11 | weight_decay: 0.0000 12 | snapshot: 10000 13 | snapshot_prefix: "./exp-referit/caffemodel/scrc_no_context_vgg" 14 | solver_mode: GPU 15 | random_seed: 1701 16 | average_loss: 100 17 | clip_gradients: 10 18 | -------------------------------------------------------------------------------- /prototxt/scrc_word_to_preds_full.prototxt: -------------------------------------------------------------------------------- 1 | input: "cont_sentence" 2 | input_shape { dim: 20 dim: 100 } 3 | 4 | input: "input_sentence" 5 | input_shape { dim: 20 dim: 100 } 6 | 7 | input: "image_features" 8 | input_shape { dim: 100 dim: 1008 } 9 | 10 | input: "fc7_context" 11 | input_shape { dim: 100 dim: 4096 } 12 | 13 | layer { 14 | name: "embedding" 15 | type: "Embed" 16 | bottom: "input_sentence" 17 | top: "embedded_input_sentence" 18 | embed_param { 19 | input_dim: 8801 20 | num_output: 1000 21 | bias_term: false 22 | } 23 | } 24 | layer { 25 | name: "lstm1" 26 | type: "LSTM" 27 | bottom: "embedded_input_sentence" 28 | bottom: "cont_sentence" 29 | top: "lstm1" 30 | recurrent_param { num_output: 1000 } 31 | } 32 | layer { 33 | name: "lstm2-extended" 34 | type: "LSTM" 35 | bottom: "lstm1" 36 | bottom: "cont_sentence" 37 | bottom: "image_features" 38 | top: "lstm2" 39 | recurrent_param { num_output: 1000 } 40 | } 41 | layer { 42 | name: "predict" 43 | type: "InnerProduct" 44 | bottom: "lstm2" 45 | top: "predict" 46 | inner_product_param { 47 | axis: 2 48 | num_output: 8801 49 | } 50 | } 51 | 52 | # Context LSTM 53 | layer { 54 | name: "fc8_context" 55 | type: "InnerProduct" 56 | bottom: "fc7_context" 57 | top: "fc8_context" 58 | inner_product_param { num_output: 1000 } 59 | } 60 | layer { 61 | name: "lstm2_context" 62 | type: "LSTM" 63 | bottom: "lstm1" 64 | bottom: "cont_sentence" 65 | bottom: "fc8_context" 66 | top: "lstm2_context" 67 | recurrent_param { num_output: 1000 } 68 | } 69 | layer { 70 | name: "predict_context" 71 | type: "InnerProduct" 72 | bottom: "lstm2_context" 73 | top: "predict_context" 74 | inner_product_param { 75 | num_output: 8801 76 | axis: 2 77 | } 78 | } 79 | layer { 80 | name: "predict_combined" 81 | type: "Eltwise" 82 | bottom: "predict" 83 | bottom: "predict_context" 84 | top: "predict_combined" 85 | eltwise_param { operation: SUM } 86 | } 87 | layer { 88 | name: "probs" 89 | type: "Softmax" 90 | bottom: "predict_combined" 91 | top: "probs" 92 | softmax_param { axis: 2 } 93 | } 94 | -------------------------------------------------------------------------------- /prototxt/scrc_word_to_preds_no_context.prototxt: -------------------------------------------------------------------------------- 1 | input: "cont_sentence" 2 | input_shape { dim: 20 dim: 100 } 3 | 4 | input: "input_sentence" 5 | input_shape { dim: 20 dim: 100 } 6 | 7 | input: "image_features" 8 | input_shape { dim: 100 dim: 1008 } 9 | 10 | layer { 11 | name: "embedding" 12 | type: "Embed" 13 | bottom: "input_sentence" 14 | top: "embedded_input_sentence" 15 | embed_param { 16 | input_dim: 8801 17 | num_output: 1000 18 | bias_term: false 19 | } 20 | } 21 | layer { 22 | name: "lstm1" 23 | type: "LSTM" 24 | bottom: "embedded_input_sentence" 25 | bottom: "cont_sentence" 26 | top: "lstm1" 27 | recurrent_param { num_output: 1000 } 28 | } 29 | layer { 30 | name: "lstm2-extended" 31 | type: "LSTM" 32 | bottom: "lstm1" 33 | bottom: "cont_sentence" 34 | bottom: "image_features" 35 | top: "lstm2" 36 | recurrent_param { num_output: 1000 } 37 | } 38 | layer { 39 | name: "predict" 40 | type: "InnerProduct" 41 | bottom: "lstm2" 42 | top: "predict" 43 | inner_product_param { 44 | axis: 2 45 | num_output: 8801 46 | } 47 | } 48 | layer { 49 | name: "probs" 50 | type: "Softmax" 51 | bottom: "predict" 52 | top: "probs" 53 | softmax_param { axis: 2 } 54 | } 55 | -------------------------------------------------------------------------------- /prototxt/scrc_word_to_preds_no_spatial_no_context.prototxt: -------------------------------------------------------------------------------- 1 | input: "cont_sentence" 2 | input_shape { dim: 20 dim: 1000 } 3 | 4 | input: "input_sentence" 5 | input_shape { dim: 20 dim: 1000 } 6 | 7 | input: "image_features" 8 | input_shape { dim: 1000 dim: 1000 } 9 | 10 | layer { 11 | name: "embedding" 12 | type: "Embed" 13 | bottom: "input_sentence" 14 | top: "embedded_input_sentence" 15 | embed_param { 16 | input_dim: 8801 17 | num_output: 1000 18 | bias_term: false 19 | } 20 | } 21 | layer { 22 | name: "lstm1" 23 | type: "LSTM" 24 | bottom: "embedded_input_sentence" 25 | bottom: "cont_sentence" 26 | top: "lstm1" 27 | recurrent_param { num_output: 1000 } 28 | } 29 | layer { 30 | name: "lstm2" 31 | type: "LSTM" 32 | bottom: "lstm1" 33 | bottom: "cont_sentence" 34 | bottom: "image_features" 35 | top: "lstm2" 36 | recurrent_param { num_output: 1000 } 37 | } 38 | layer { 39 | name: "predict" 40 | type: "InnerProduct" 41 | bottom: "lstm2" 42 | top: "predict" 43 | inner_product_param { 44 | axis: 2 45 | num_output: 8801 46 | } 47 | } 48 | layer { 49 | name: "probs" 50 | type: "Softmax" 51 | bottom: "predict" 52 | top: "probs" 53 | softmax_param { axis: 2 } 54 | } 55 | -------------------------------------------------------------------------------- /retriever.py: -------------------------------------------------------------------------------- 1 | from __future__ import division, print_function 2 | 3 | import os 4 | import re 5 | import numpy as np 6 | import h5py 7 | import skimage.io 8 | 9 | # Compute vocabulary indices from sentence 10 | MAX_WORDS = 20 11 | UNK_IDENTIFIER = '' # is the word used to identify unknown words 12 | SENTENCE_SPLIT_REGEX = re.compile(r'(\W+)') 13 | def sentence2vocab_indices(raw_sentence, vocab_dict): 14 | splits = SENTENCE_SPLIT_REGEX.split(raw_sentence.strip()) 15 | sentence = [ s.lower() for s in splits if len(s.strip()) > 0 ] 16 | # remove . 17 | if sentence[-1] == '.': 18 | sentence = sentence[:-1] 19 | vocab_indices = [ (vocab_dict[s] if vocab_dict.has_key(s) else vocab_dict[UNK_IDENTIFIER]) 20 | for s in sentence ] 21 | if len(vocab_indices) > MAX_WORDS: 22 | vocab_indices = vocab_indices[:MAX_WORDS] 23 | return vocab_indices 24 | 25 | # Build vocabulary dictionary from file 26 | def build_vocab_dict_from_file(vocab_file): 27 | vocab = [''] 28 | with open(vocab_file, 'r') as f: 29 | lines = f.readlines() 30 | vocab += [ word.strip() for word in lines ] 31 | vocab_dict = { vocab[n] : n for n in range(len(vocab)) } 32 | return vocab_dict 33 | 34 | # Build vocabulary dictionary from captioner 35 | def build_vocab_dict_from_captioner(captioner): 36 | vocab_dict = {captioner.vocab[n] : n for n in range(len(captioner.vocab))} 37 | return vocab_dict 38 | 39 | def score_descriptors(descriptors, raw_sentence, captioner, vocab_dict): 40 | vocab_indices = sentence2vocab_indices(raw_sentence, vocab_dict) 41 | num_descriptors = descriptors.shape[0] 42 | scores = np.zeros(num_descriptors) 43 | 44 | net = captioner.lstm_net 45 | 46 | T = len(vocab_indices) 47 | N = descriptors.shape[0] 48 | # reshape only when necessary 49 | if list(net.blobs['cont_sentence'].shape) != [MAX_WORDS, N]: 50 | net.blobs['cont_sentence'].reshape(MAX_WORDS, N) 51 | net.blobs['input_sentence'].reshape(MAX_WORDS, N) 52 | net.blobs['image_features'].reshape(N, *net.blobs['image_features'].data.shape[1:]) 53 | # print('LSTM net reshape to ' + str([MAX_WORDS, N])) 54 | 55 | cont_sentence = np.array([0] + [1 for v in vocab_indices[:-1] ]).reshape((-1, 1)) 56 | input_sentence = np.array([0] + vocab_indices[:-1] ).reshape((-1, 1)) 57 | 58 | net.blobs['cont_sentence'].data[:T, :] = cont_sentence 59 | net.blobs['input_sentence'].data[:T, :] = input_sentence 60 | net.blobs['image_features'].data[...] = descriptors 61 | net.forward() 62 | 63 | probs = net.blobs['probs'].data[:T, :, :] 64 | for t in range(T): 65 | scores += np.log(probs[t, :, vocab_indices[t] ]) 66 | return scores 67 | 68 | def score_descriptors_context(descriptors, raw_sentence, fc7_context, captioner, vocab_dict): 69 | vocab_indices = sentence2vocab_indices(raw_sentence, vocab_dict) 70 | num_descriptors = descriptors.shape[0] 71 | scores = np.zeros(num_descriptors) 72 | 73 | net = captioner.lstm_net 74 | 75 | T = len(vocab_indices) 76 | N = descriptors.shape[0] 77 | # reshape only when necessary 78 | if list(net.blobs['cont_sentence'].shape) != [MAX_WORDS, N]: 79 | net.blobs['cont_sentence'].reshape(MAX_WORDS, N) 80 | net.blobs['input_sentence'].reshape(MAX_WORDS, N) 81 | net.blobs['image_features'].reshape(N, *net.blobs['image_features'].data.shape[1:]) 82 | net.blobs['fc7_context'].reshape(N, *net.blobs['fc7_context'].data.shape[1:]) 83 | # print('LSTM net reshape to ' + str([MAX_WORDS, N])) 84 | 85 | cont_sentence = np.array([0] + [1 for v in vocab_indices[:-1] ]).reshape((-1, 1)) 86 | input_sentence = np.array([0] + vocab_indices[:-1] ).reshape((-1, 1)) 87 | 88 | net.blobs['cont_sentence'].data[:T, :] = cont_sentence 89 | net.blobs['input_sentence'].data[:T, :] = input_sentence 90 | net.blobs['image_features'].data[...] = descriptors 91 | net.blobs['fc7_context'].data[...] = fc7_context 92 | net.forward() 93 | 94 | probs = net.blobs['probs'].data[:T, :, :] 95 | for t in range(T): 96 | scores += np.log(probs[t, :, vocab_indices[t] ]) 97 | return scores 98 | 99 | 100 | # all boxes are [xmin, ymin, xmax, ymax] format, 0-indexed, including xmax and ymax 101 | def compute_iou(boxes, target): 102 | assert(target.ndim == 1 and boxes.ndim == 2) 103 | A_boxes = (boxes[:, 2] - boxes[:, 0] + 1) * (boxes[:, 3] - boxes[:, 1] + 1) 104 | A_target = (target[2] - target[0] + 1) * (target[3] - target[1] + 1) 105 | assert(np.all(A_boxes >= 0)) 106 | assert(np.all(A_target >= 0)) 107 | I_x1 = np.maximum(boxes[:, 0], target[0]) 108 | I_y1 = np.maximum(boxes[:, 1], target[1]) 109 | I_x2 = np.minimum(boxes[:, 2], target[2]) 110 | I_y2 = np.minimum(boxes[:, 3], target[3]) 111 | A_I = np.maximum(I_x2 - I_x1 + 1, 0) * np.maximum(I_y2 - I_y1 + 1, 0) 112 | IoUs = A_I / (A_boxes + A_target - A_I) 113 | assert(np.all(0 <= IoUs) and np.all(IoUs <= 1)) 114 | return IoUs 115 | 116 | def crop_edge_boxes(image, edge_boxes): 117 | # load images 118 | if type(image) in (str, unicode): 119 | image = skimage.io.imread(image) 120 | if image.dtype == np.float32: 121 | image *= 255 122 | image = image.astype(np.uint8) 123 | # Gray scale to RGB 124 | if image.ndim == 2: 125 | image = np.tile(image[..., np.newaxis], (1, 1, 3)) 126 | # RGBA to RGB 127 | image = image[:, :, :3] 128 | x1, y1, x2, y2 = edge_boxes[:, 0], edge_boxes[:, 1], edge_boxes[:, 2], edge_boxes[:, 3] 129 | crops = [image[y1[n]:y2[n]+1, x1[n]:x2[n]+1, :] for n in range(edge_boxes.shape[0])] 130 | return crops 131 | 132 | def compute_descriptors_edgebox(captioner, image, edge_boxes, output_name='fc8'): 133 | crops = crop_edge_boxes(image, edge_boxes); 134 | return compute_descriptors(captioner, crops, output_name) 135 | 136 | def preprocess_image(captioner, image, verbose=False): 137 | if type(image) in (str, unicode): 138 | image = skimage.io.imread(image) 139 | if image.dtype == np.float32: 140 | image *= 255 141 | image = image.astype(np.uint8) 142 | # Gray scale to RGB 143 | if image.ndim == 2: 144 | image = np.tile(image[..., np.newaxis], (1, 1, 3)) 145 | # RGBA to RGB 146 | image = image[:, :, :3] 147 | preprocessed_image = captioner.transformer.preprocess('data', image) 148 | return preprocessed_image 149 | 150 | def compute_descriptors(captioner, image_list, output_name='fc8'): 151 | batch = np.zeros_like(captioner.image_net.blobs['data'].data) 152 | batch_shape = batch.shape 153 | batch_size = batch_shape[0] 154 | descriptors_shape = (len(image_list), ) + \ 155 | captioner.image_net.blobs[output_name].data.shape[1:] 156 | descriptors = np.zeros(descriptors_shape) 157 | for batch_start_index in range(0, len(image_list), batch_size): 158 | batch_list = image_list[batch_start_index:(batch_start_index + batch_size)] 159 | for batch_index, image_path in enumerate(batch_list): 160 | batch[batch_index:(batch_index + 1)] = preprocess_image(captioner, image_path) 161 | current_batch_size = min(batch_size, len(image_list) - batch_start_index) 162 | captioner.image_net.forward(data=batch) 163 | descriptors[batch_start_index:(batch_start_index + current_batch_size)] = \ 164 | captioner.image_net.blobs[output_name].data[:current_batch_size] 165 | return descriptors 166 | 167 | # normalize bounding box features into 8-D feature 168 | def compute_spatial_feat(bboxes, image_size): 169 | if bboxes.ndim == 1: 170 | bboxes = bboxes.reshape((1, 4)) 171 | im_w = image_size[0] 172 | im_h = image_size[1] 173 | assert(np.all(bboxes[:, 0] < im_w) and np.all(bboxes[:, 2] < im_w)) 174 | assert(np.all(bboxes[:, 1] < im_h) and np.all(bboxes[:, 3] < im_h)) 175 | 176 | feats = np.zeros((bboxes.shape[0], 8)) 177 | feats[:, 0] = bboxes[:, 0] * 2.0 / im_w - 1 # x1 178 | feats[:, 1] = bboxes[:, 1] * 2.0 / im_h - 1 # y1 179 | feats[:, 2] = bboxes[:, 2] * 2.0 / im_w - 1 # x2 180 | feats[:, 3] = bboxes[:, 3] * 2.0 / im_h - 1 # y2 181 | feats[:, 4] = (feats[:, 0] + feats[:, 2]) / 2 # x0 182 | feats[:, 5] = (feats[:, 1] + feats[:, 3]) / 2 # y0 183 | feats[:, 6] = feats[:, 2] - feats[:, 0] # w 184 | feats[:, 7] = feats[:, 3] - feats[:, 1] # h 185 | return feats 186 | 187 | # Write a batch of sentences to HDF5 188 | def write_batch_to_hdf5(filename, cont_sentences, input_sentences, 189 | target_sentences, dtype=np.float32): 190 | h5file = h5py.File(filename, 'w') 191 | dataset = h5file.create_dataset('cont_sentence', 192 | shape=cont_sentences.shape, dtype=np.float32) 193 | dataset[:] = cont_sentences 194 | dataset = h5file.create_dataset('input_sentence', 195 | shape=input_sentences.shape, dtype=np.float32) 196 | dataset[:] = input_sentences 197 | dataset = h5file.create_dataset('target_sentence', 198 | shape=target_sentences.shape, dtype=np.float32) 199 | dataset[:] = target_sentences 200 | h5file.close() 201 | 202 | # Write a batch of sentences to HDF5 203 | def write_bbox_to_hdf5(filename, bbox_coordinates, dtype=np.float32): 204 | h5file = h5py.File(filename, 'w') 205 | dataset = h5file.create_dataset('bbox_coordinate', 206 | shape=bbox_coordinates.shape, dtype=np.float32) 207 | dataset[:] = bbox_coordinates 208 | h5file.close() 209 | 210 | # Write a batch of sentences to HDF5 211 | def write_bbox_context_to_hdf5(filename, bbox_coordinates, fc7_context, dtype=np.float32): 212 | h5file = h5py.File(filename, 'w') 213 | dataset = h5file.create_dataset('bbox_coordinate', 214 | shape=bbox_coordinates.shape, dtype=np.float32) 215 | dataset[:] = bbox_coordinates 216 | dataset = h5file.create_dataset('fc7_context', 217 | shape=fc7_context.shape, dtype=np.float32) 218 | dataset[:] = fc7_context 219 | h5file.close() 220 | -------------------------------------------------------------------------------- /util/__init__.py: -------------------------------------------------------------------------------- 1 | from . import io 2 | -------------------------------------------------------------------------------- /util/io.py: -------------------------------------------------------------------------------- 1 | import json 2 | 3 | def load_str_list(filename): 4 | with open(filename, 'r') as f: 5 | str_list = f.readlines() 6 | str_list = [s[:-1] for s in str_list] 7 | return str_list 8 | 9 | def save_str_list(str_list, filename): 10 | str_list = [s+'\n' for s in str_list] 11 | with open(filename, 'w') as f: 12 | f.writelines(str_list) 13 | 14 | def load_json(filename): 15 | with open(filename, 'r') as f: 16 | return json.load(f) 17 | 18 | def save_json(json_obj, filename): 19 | with open(filename, 'w') as f: 20 | json.dump(json_obj, f, separators=(',\n', ':\n')) 21 | --------------------------------------------------------------------------------