├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── Summary_Template.md ├── images ├── ADF-IS_ROC.png ├── ADF-IS_archi.png ├── ADF-IS_attack .png ├── AI_Control1.png ├── AI_Control2.png ├── Attention_Calculation.png ├── BYOL.png ├── BYOL_graph.png ├── BYOL_loss.png ├── CLIP-results.png ├── CLIP_working.png ├── Calibration-affecting-factors.png ├── Calibration-comparison-LeNet_Resnet.png ├── Calibration-results.png ├── Calibration-temp-scaling.png ├── Cicero_architecture.jpg ├── Cicero_results.jpg ├── Conclusion.png ├── DIRE.png ├── DoodlerGAN architecture.png ├── DyT_BN.png ├── DyT_DiT.png ├── DyT_LLaMa.png ├── DyT_ViT.png ├── DyT_efficiency.png ├── DyT_working.png ├── GANcraft_fid_scores.PNG ├── GANcraft_model.PNG ├── GANcraft_overview.PNG ├── GANcraft_results.gif ├── GIRAFFE_model.PNG ├── GIRAFFE_overview.PNG ├── GIRAFFE_results.gif ├── GameGAN.png ├── GameGAN_LSTM.png ├── GameGAN_cycle_loss.png ├── GameGAN_memory_eqn.png ├── Garfield1.jpg ├── Garfield2.png ├── Garfield3.jpg ├── Garfield4.jpg ├── Garfield5.jpg ├── GroupDNet.png ├── Image_Hijack1.jpg ├── Image_Hijack2.jpg ├── LayerNorm_results.png ├── Multihead_attention.png ├── NMT_img1.png ├── NMT_img2.png ├── Output images.png ├── PauseTokensignoreoutput.png ├── Pause_loss_function.png ├── Pause_results.png ├── SAM_architecture.png ├── SAM_dataengine.png ├── SAM_decoder.png ├── SISA_Learning.png ├── SMIS_contributions.png ├── SMIS_task.png ├── SiamMae-Archi.png ├── SiamMae-Results.png ├── SiamMae_comparision.png ├── The-Transformer-model-architecture.png ├── Tune_a_video1.png ├── Tune_a_video2.png ├── Tune_a_video3.png ├── Tune_a_video4.png ├── Tune_a_video5.png ├── URIAL_Distribution_shift.png ├── URIAL_Graph.png ├── URIAL_Main_method.png ├── URIAL_Results.png ├── VQAScore_Architechture.png ├── VQAScore_Attributional.png ├── VQAScore_GenAI-bench.png ├── VQAScore_Results1.png ├── VQAscore_examples.png ├── ViLBERT_Model.png ├── ViLBERT_TRM_Co-TRM.png ├── ViLBERT_pretraining_tasks.png ├── ViT_architecture.png ├── WARM_Weight_avg.png ├── WARM_formula.png ├── WARM_procedure.png ├── WARM_results1.png ├── WARM_results2.png ├── What_do_neural_networks_learn_in_image_classification-A_frequency_shortcut_perspective_ADCS.png ├── What_do_neural_networks_learn_in_image_classification-A_frequency_shortcut_perspective_DFMs.png ├── What_do_neural_networks_learn_in_image_classification-A_frequency_shortcut_perspective_Eqns.png ├── What_do_neural_networks_learn_in_image_classification-A_frequency_shortcut_perspective_SynDatasets.png ├── adv_RL_1.png ├── baboon.png ├── binaryTTC.png ├── control_tasks.png ├── cycada.png ├── cyclegan.png ├── densenet.png ├── drawbech_2.png ├── ecpe.png ├── entity_linking.png ├── feature_denoising.png ├── florence.png ├── graph_unet.png ├── groknet_process.png ├── groknet_table.png ├── ical1.jpg ├── ical2.jpg ├── imagen_create.png ├── info_bottleneck_loss_1.png ├── info_bottleneck_loss_2.png ├── infomax.png ├── information_bottleneck.png ├── instructscene.png ├── instructscene_results_1.png ├── instructscene_results_2.png ├── knowledge_distillation_ECE_NLL.png ├── knowledge_distillation_calibration_score.png ├── knowledge_distillation_model.png ├── lavila_ego4d.gif ├── lavila_ft.jpeg ├── lavila_narrator.png ├── lavila_zs.jpeg ├── prototype.png ├── self-attention-output.png ├── singan.png ├── spectrum.png ├── srn.png ├── srn_2.png ├── textual_inversion 1.png ├── textual_inversion 2.png ├── textual_inversion 3.png ├── textual_inversion 4.png ├── this_looks_like_that.png ├── transformer_performance.png ├── uniform_convergence.png ├── unsupervised_deformable_3D.png ├── unsupervised_deformable_3D_loss.png ├── unsupervised_deformable_3D_perceptual_loss.png ├── unsupervised_deformable_3D_summary.png ├── vgraph.png ├── visual_attention.png ├── w2v-bert.png └── yoto_method.png └── summaries ├── 3d_object_detection.md ├── AI_Control.md ├── Ablating Concepts in Text-to-Image Diffusion Models.md ├── Adversarial_RL.md ├── Attention_Is_All_You_Need.md ├── BART_Denoising_Sequence_to_Sequence_Pretraining_for_Natural_Language_Generation_Translation_and_Comprehension.md ├── BYOL.md ├── Big_Bird_Transformers.md ├── CICERO.md ├── CLIP.md ├── Calibration_of_Neural_Nets.md ├── DIRE.md ├── DoodlerGAN summary.md ├── DreamBooth.md ├── FFA-Net.md ├── GANcraft.md ├── GARField.md ├── GIRAFFE.md ├── GameGAN.md ├── GrokNet.md ├── ICAL.md ├── Image_Hijacks.md ├── Image_Steganography_ADF-IS.md ├── Matryoshka_Diffusion.md ├── Multi Concept_Customization_of_Text_to_Image_Diffusion.md ├── NMT_Gap.md ├── Pause_Tokens.md ├── RLHF_Diversity.md ├── RainbowMemory.md ├── Segment_Anything.md ├── Semantically_multi-modal_image_synthesis.md ├── SiamMAE.md ├── Textual_inversion.md ├── Transformers_without_Normalization.md ├── Tune_A_Video.md ├── URIAL.md ├── Universal_Transferable_Adversial_Attacks.md ├── Unsupervised_learning_for_3D_objects_from_images.md ├── VQAscore.md ├── ViLBERT.md ├── Video_Representation_LLM.md ├── Vision_Transformer.md ├── W2V-BERT.md ├── WARM.md ├── What_do_neural_networks_learn_in_image_classification-A_frequency_shortcut_perspective.md ├── You_only_train_once.md ├── attention.md ├── binary_TTC.md ├── control_tasks.md ├── cycada.md ├── cycle_dehaze.md ├── cyclegan.md ├── densenet.md ├── ecpe.md ├── entity_linking.md ├── feature_denoising.md ├── florence.md ├── frequency_bias.md ├── graph_unet.md ├── imagen.md ├── info_bottleneck.md ├── infomax.md ├── instructscene.md ├── knowledge_distillation_with_model_calibration.md ├── machine_unlearning.md ├── siamese.md ├── singan.md ├── srn.md ├── style_GAN.md ├── this_looks_like_that.md ├── uniform_convergence.md ├── vgraph.md └── vision_attention.md /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing to papers_we_read 2 | 3 | All kinds of paper summaries are welcome. 4 | 5 | ## How to Open a Pull Request (PR) 6 | 7 | 1. **Fork** and **Pull** the latest `papers_we_read`. 8 | 2. **Checkout** to a new branch with the name ``. (DO NOT use master branch for PRs!) 9 | 3. **Commit** your summaries on the new branch on your forked repo: 10 | - Please make a single commit for ease of reviewing before opening a PR with the commit message: `:zap: Add Summary for `. 11 | 4. Create a PR with the title: `:zap: Add Summary for `. 12 | 13 | ## Summary Guidelines 14 | 15 | 1. All the summaries must be written using a markdown format following the [Summary Template](Summary_Template.md). 16 | 2. The `.md` file shoule be name `` and must be included under the [summaries](summaries/) folder. 17 | 3. Put any images, animations, etc. that you used in the markdown files under the [images](images/) folder. 18 | 19 | ## PR Review 20 | 21 | The PRs will be reviewed and merged by the core members at the [Vision and Language Group IITR](https://vlgiitr.github.io/). 22 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Vision & Language Group, IIT Roorkee 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Summary_Template.md: -------------------------------------------------------------------------------- 1 | # PAPER_TITLE 2 | 3 | Author 1, Author 2, ..., **** **** 4 | 5 | ## Summary 6 | 7 | Include a brief description about the paper from an intuitive point of view if possible. 8 | 9 | ## Contributions 10 | 11 | Major contributions of the paper in bullet points. 12 | 13 | ## Method 14 | 15 | - A comprehensive description of the methodology proposed in the paper. 16 | - You may include any images for better presentation. 17 | - Mathematical equations may also be included for better understanding. 18 | 19 | ## Results 20 | 21 | - Comments on the results in the paper. 22 | - Comparisons with the baselines and existing SOTA. 23 | 24 | ## Two-Cents 25 | 26 | Your personal opinion about the paper including appreciations, criticism and possible future directions for research. 27 | 28 | ## Resources 29 | 30 | Any links to the project page, youtube video, implementation, blog, etc. for the paper. 31 | -------------------------------------------------------------------------------- /images/ADF-IS_ROC.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/ADF-IS_ROC.png -------------------------------------------------------------------------------- /images/ADF-IS_archi.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/ADF-IS_archi.png -------------------------------------------------------------------------------- /images/ADF-IS_attack .png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/ADF-IS_attack .png -------------------------------------------------------------------------------- /images/AI_Control1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/AI_Control1.png -------------------------------------------------------------------------------- /images/AI_Control2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/AI_Control2.png -------------------------------------------------------------------------------- /images/Attention_Calculation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/Attention_Calculation.png -------------------------------------------------------------------------------- /images/BYOL.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/BYOL.png -------------------------------------------------------------------------------- /images/BYOL_graph.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/BYOL_graph.png -------------------------------------------------------------------------------- /images/BYOL_loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/BYOL_loss.png -------------------------------------------------------------------------------- /images/CLIP-results.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/CLIP-results.png -------------------------------------------------------------------------------- /images/CLIP_working.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/CLIP_working.png -------------------------------------------------------------------------------- /images/Calibration-affecting-factors.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/Calibration-affecting-factors.png -------------------------------------------------------------------------------- /images/Calibration-comparison-LeNet_Resnet.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/Calibration-comparison-LeNet_Resnet.png -------------------------------------------------------------------------------- /images/Calibration-results.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/Calibration-results.png -------------------------------------------------------------------------------- /images/Calibration-temp-scaling.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/Calibration-temp-scaling.png -------------------------------------------------------------------------------- /images/Cicero_architecture.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/Cicero_architecture.jpg -------------------------------------------------------------------------------- /images/Cicero_results.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/Cicero_results.jpg -------------------------------------------------------------------------------- /images/Conclusion.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/Conclusion.png -------------------------------------------------------------------------------- /images/DIRE.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/DIRE.png -------------------------------------------------------------------------------- /images/DoodlerGAN architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/DoodlerGAN architecture.png -------------------------------------------------------------------------------- /images/DyT_BN.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/DyT_BN.png -------------------------------------------------------------------------------- /images/DyT_DiT.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/DyT_DiT.png -------------------------------------------------------------------------------- /images/DyT_LLaMa.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/DyT_LLaMa.png -------------------------------------------------------------------------------- /images/DyT_ViT.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/DyT_ViT.png -------------------------------------------------------------------------------- /images/DyT_efficiency.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/DyT_efficiency.png -------------------------------------------------------------------------------- /images/DyT_working.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/DyT_working.png -------------------------------------------------------------------------------- /images/GANcraft_fid_scores.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/GANcraft_fid_scores.PNG -------------------------------------------------------------------------------- /images/GANcraft_model.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/GANcraft_model.PNG -------------------------------------------------------------------------------- /images/GANcraft_overview.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/GANcraft_overview.PNG -------------------------------------------------------------------------------- /images/GANcraft_results.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/GANcraft_results.gif -------------------------------------------------------------------------------- /images/GIRAFFE_model.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/GIRAFFE_model.PNG -------------------------------------------------------------------------------- /images/GIRAFFE_overview.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/GIRAFFE_overview.PNG -------------------------------------------------------------------------------- /images/GIRAFFE_results.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/GIRAFFE_results.gif -------------------------------------------------------------------------------- /images/GameGAN.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/GameGAN.png -------------------------------------------------------------------------------- /images/GameGAN_LSTM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/GameGAN_LSTM.png -------------------------------------------------------------------------------- /images/GameGAN_cycle_loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/GameGAN_cycle_loss.png -------------------------------------------------------------------------------- /images/GameGAN_memory_eqn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/GameGAN_memory_eqn.png -------------------------------------------------------------------------------- /images/Garfield1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/Garfield1.jpg -------------------------------------------------------------------------------- /images/Garfield2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/Garfield2.png -------------------------------------------------------------------------------- /images/Garfield3.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/Garfield3.jpg -------------------------------------------------------------------------------- /images/Garfield4.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/Garfield4.jpg -------------------------------------------------------------------------------- /images/Garfield5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/Garfield5.jpg -------------------------------------------------------------------------------- /images/GroupDNet.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/GroupDNet.png -------------------------------------------------------------------------------- /images/Image_Hijack1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/Image_Hijack1.jpg -------------------------------------------------------------------------------- /images/Image_Hijack2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/Image_Hijack2.jpg -------------------------------------------------------------------------------- /images/LayerNorm_results.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/LayerNorm_results.png -------------------------------------------------------------------------------- /images/Multihead_attention.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/Multihead_attention.png -------------------------------------------------------------------------------- /images/NMT_img1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/NMT_img1.png -------------------------------------------------------------------------------- /images/NMT_img2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/NMT_img2.png -------------------------------------------------------------------------------- /images/Output images.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/Output images.png -------------------------------------------------------------------------------- /images/PauseTokensignoreoutput.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/PauseTokensignoreoutput.png -------------------------------------------------------------------------------- /images/Pause_loss_function.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/Pause_loss_function.png -------------------------------------------------------------------------------- /images/Pause_results.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/Pause_results.png -------------------------------------------------------------------------------- /images/SAM_architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/SAM_architecture.png -------------------------------------------------------------------------------- /images/SAM_dataengine.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/SAM_dataengine.png -------------------------------------------------------------------------------- /images/SAM_decoder.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/SAM_decoder.png -------------------------------------------------------------------------------- /images/SISA_Learning.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/SISA_Learning.png -------------------------------------------------------------------------------- /images/SMIS_contributions.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/SMIS_contributions.png -------------------------------------------------------------------------------- /images/SMIS_task.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/SMIS_task.png -------------------------------------------------------------------------------- /images/SiamMae-Archi.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/SiamMae-Archi.png -------------------------------------------------------------------------------- /images/SiamMae-Results.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/SiamMae-Results.png -------------------------------------------------------------------------------- /images/SiamMae_comparision.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/SiamMae_comparision.png -------------------------------------------------------------------------------- /images/The-Transformer-model-architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/The-Transformer-model-architecture.png -------------------------------------------------------------------------------- /images/Tune_a_video1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/Tune_a_video1.png -------------------------------------------------------------------------------- /images/Tune_a_video2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/Tune_a_video2.png -------------------------------------------------------------------------------- /images/Tune_a_video3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/Tune_a_video3.png -------------------------------------------------------------------------------- /images/Tune_a_video4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/Tune_a_video4.png -------------------------------------------------------------------------------- /images/Tune_a_video5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/Tune_a_video5.png -------------------------------------------------------------------------------- /images/URIAL_Distribution_shift.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/URIAL_Distribution_shift.png -------------------------------------------------------------------------------- /images/URIAL_Graph.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/URIAL_Graph.png -------------------------------------------------------------------------------- /images/URIAL_Main_method.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/URIAL_Main_method.png -------------------------------------------------------------------------------- /images/URIAL_Results.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/URIAL_Results.png -------------------------------------------------------------------------------- /images/VQAScore_Architechture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/VQAScore_Architechture.png -------------------------------------------------------------------------------- /images/VQAScore_Attributional.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/VQAScore_Attributional.png -------------------------------------------------------------------------------- /images/VQAScore_GenAI-bench.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/VQAScore_GenAI-bench.png -------------------------------------------------------------------------------- /images/VQAScore_Results1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/VQAScore_Results1.png -------------------------------------------------------------------------------- /images/VQAscore_examples.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/VQAscore_examples.png -------------------------------------------------------------------------------- /images/ViLBERT_Model.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/ViLBERT_Model.png -------------------------------------------------------------------------------- /images/ViLBERT_TRM_Co-TRM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/ViLBERT_TRM_Co-TRM.png -------------------------------------------------------------------------------- /images/ViLBERT_pretraining_tasks.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/ViLBERT_pretraining_tasks.png -------------------------------------------------------------------------------- /images/ViT_architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/ViT_architecture.png -------------------------------------------------------------------------------- /images/WARM_Weight_avg.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/WARM_Weight_avg.png -------------------------------------------------------------------------------- /images/WARM_formula.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/WARM_formula.png -------------------------------------------------------------------------------- /images/WARM_procedure.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/WARM_procedure.png -------------------------------------------------------------------------------- /images/WARM_results1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/WARM_results1.png -------------------------------------------------------------------------------- /images/WARM_results2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/WARM_results2.png -------------------------------------------------------------------------------- /images/What_do_neural_networks_learn_in_image_classification-A_frequency_shortcut_perspective_ADCS.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/What_do_neural_networks_learn_in_image_classification-A_frequency_shortcut_perspective_ADCS.png -------------------------------------------------------------------------------- /images/What_do_neural_networks_learn_in_image_classification-A_frequency_shortcut_perspective_DFMs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/What_do_neural_networks_learn_in_image_classification-A_frequency_shortcut_perspective_DFMs.png -------------------------------------------------------------------------------- /images/What_do_neural_networks_learn_in_image_classification-A_frequency_shortcut_perspective_Eqns.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/What_do_neural_networks_learn_in_image_classification-A_frequency_shortcut_perspective_Eqns.png -------------------------------------------------------------------------------- /images/What_do_neural_networks_learn_in_image_classification-A_frequency_shortcut_perspective_SynDatasets.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/What_do_neural_networks_learn_in_image_classification-A_frequency_shortcut_perspective_SynDatasets.png -------------------------------------------------------------------------------- /images/adv_RL_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/adv_RL_1.png -------------------------------------------------------------------------------- /images/baboon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/baboon.png -------------------------------------------------------------------------------- /images/binaryTTC.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/binaryTTC.png -------------------------------------------------------------------------------- /images/control_tasks.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/control_tasks.png -------------------------------------------------------------------------------- /images/cycada.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/cycada.png -------------------------------------------------------------------------------- /images/cyclegan.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/cyclegan.png -------------------------------------------------------------------------------- /images/densenet.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/densenet.png -------------------------------------------------------------------------------- /images/drawbech_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/drawbech_2.png -------------------------------------------------------------------------------- /images/ecpe.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/ecpe.png -------------------------------------------------------------------------------- /images/entity_linking.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/entity_linking.png -------------------------------------------------------------------------------- /images/feature_denoising.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/feature_denoising.png -------------------------------------------------------------------------------- /images/florence.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/florence.png -------------------------------------------------------------------------------- /images/graph_unet.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/graph_unet.png -------------------------------------------------------------------------------- /images/groknet_process.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/groknet_process.png -------------------------------------------------------------------------------- /images/groknet_table.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/groknet_table.png -------------------------------------------------------------------------------- /images/ical1.jpg: -------------------------------------------------------------------------------- 1 | ����JFIF,,��dICC_PROFILETlcms0mntrRGB XYZ �acspAPPL���-lcms -------------------------------------------------------------------------------- /images/ical2.jpg: -------------------------------------------------------------------------------- 1 | ����JFIF,,��C  2 | 3 |  4 | -------------------------------------------------------------------------------- /images/imagen_create.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/imagen_create.png -------------------------------------------------------------------------------- /images/info_bottleneck_loss_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/info_bottleneck_loss_1.png -------------------------------------------------------------------------------- /images/info_bottleneck_loss_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/info_bottleneck_loss_2.png -------------------------------------------------------------------------------- /images/infomax.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/infomax.png -------------------------------------------------------------------------------- /images/information_bottleneck.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/information_bottleneck.png -------------------------------------------------------------------------------- /images/instructscene.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/instructscene.png -------------------------------------------------------------------------------- /images/instructscene_results_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/instructscene_results_1.png -------------------------------------------------------------------------------- /images/instructscene_results_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/instructscene_results_2.png -------------------------------------------------------------------------------- /images/knowledge_distillation_ECE_NLL.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/knowledge_distillation_ECE_NLL.png -------------------------------------------------------------------------------- /images/knowledge_distillation_calibration_score.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/knowledge_distillation_calibration_score.png -------------------------------------------------------------------------------- /images/knowledge_distillation_model.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/knowledge_distillation_model.png -------------------------------------------------------------------------------- /images/lavila_ego4d.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/lavila_ego4d.gif -------------------------------------------------------------------------------- /images/lavila_ft.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/lavila_ft.jpeg -------------------------------------------------------------------------------- /images/lavila_narrator.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/lavila_narrator.png -------------------------------------------------------------------------------- /images/lavila_zs.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/lavila_zs.jpeg -------------------------------------------------------------------------------- /images/prototype.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/prototype.png -------------------------------------------------------------------------------- /images/self-attention-output.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/self-attention-output.png -------------------------------------------------------------------------------- /images/singan.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/singan.png -------------------------------------------------------------------------------- /images/spectrum.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/spectrum.png -------------------------------------------------------------------------------- /images/srn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/srn.png -------------------------------------------------------------------------------- /images/srn_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/srn_2.png -------------------------------------------------------------------------------- /images/textual_inversion 1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/textual_inversion 1.png -------------------------------------------------------------------------------- /images/textual_inversion 2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/textual_inversion 2.png -------------------------------------------------------------------------------- /images/textual_inversion 3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/textual_inversion 3.png -------------------------------------------------------------------------------- /images/textual_inversion 4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/textual_inversion 4.png -------------------------------------------------------------------------------- /images/this_looks_like_that.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/this_looks_like_that.png -------------------------------------------------------------------------------- /images/transformer_performance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/transformer_performance.png -------------------------------------------------------------------------------- /images/uniform_convergence.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/uniform_convergence.png -------------------------------------------------------------------------------- /images/unsupervised_deformable_3D.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/unsupervised_deformable_3D.png -------------------------------------------------------------------------------- /images/unsupervised_deformable_3D_loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/unsupervised_deformable_3D_loss.png -------------------------------------------------------------------------------- /images/unsupervised_deformable_3D_perceptual_loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/unsupervised_deformable_3D_perceptual_loss.png -------------------------------------------------------------------------------- /images/unsupervised_deformable_3D_summary.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/unsupervised_deformable_3D_summary.png -------------------------------------------------------------------------------- /images/vgraph.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/vgraph.png -------------------------------------------------------------------------------- /images/visual_attention.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/visual_attention.png -------------------------------------------------------------------------------- /images/w2v-bert.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/w2v-bert.png -------------------------------------------------------------------------------- /images/yoto_method.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/vlgiitr/papers_we_read/dae679d1447ac2abd48e56cda55ffdbef4945f67/images/yoto_method.png -------------------------------------------------------------------------------- /summaries/Ablating Concepts in Text-to-Image Diffusion Models.md: -------------------------------------------------------------------------------- 1 | # Ablating Concepts in Text-to-Image Diffusion Models 2 | Nupur Kumari et. al , **ICCV** **2023** 3 | 4 | ## Summary 5 | The paper introduces a method for modifying pre trained text-to-image diffusion models to remove certain undesired concepts while preserving related ones to ablate (remove) undesired concepts such as copyrighted material, memorized images, or specific art styles. 6 | 7 | ## Contributions 8 | 9 | 1. Addressing the challenge of efficiently preventing a diffusion model from generating specific concepts without retraining from scratch: The paper presents two strategies for aligning the target concept distribution with an anchor concept's distribution. The method is demonstrated to be highly efficient, taking only about five minutes per concept for model updates on NVIDIA A100 GPU. 10 | 11 | 2. Ethical Control in Generative AI: A notable contribution is the introduction of a new avenue for controlling generative AI output ethically. By allowing for the modification of specific concepts while preserving related ones, the research enhances ethical governance over diffusion model outputs. 12 | 13 | ## Method 14 | 15 | The primary challenge addressed by the paper is efficiently preventing a diffusion model from generating specific concepts without retraining from scratch or losing related concepts. This is tackled by aligning the target concept image distribution - which is to be ablated - with the distribution of an anchor concept. 16 | 17 | They assume that the user provides the desired anchor concept, e.g., Cat for Grumpy Cat. The anchor concept overwrites the target concept and should be a superset or similar to the target concept. Thus, given a set of text prompts ${c^ {*} }$ describing the target concept, they aim to match the following two distributions via Kullback–Leibler (KL) divergence: 18 | 19 | $$ \begin{aligned} 20 | \arg \min_{\hat{\Phi}} D_{\text{KL}}(p(x(0...T)|c) || p_{\hat{\Phi}}(x(0...T)|c^*)), 21 | \end{aligned} $$ 22 | 23 | where $p(x_{(0...T)}|c)$ is some target distribution on the $x_t, t \in [0, T]$, defined by the anchor concept c, and $p_{\hat{\Phi}}(x_{(0...T)}|c^{ * })$ is the model’s distribution for the target concept. 24 | 25 | ## Results 26 | Extensive experiments validate the effectiveness of the proposed method across 16 ablation tasks, showing strong numerical evidence of the ability to obviate target concepts – which included specific objects, styles, and memorized images. 27 | ![image](https://github.com/shivank21/1y-recruitment-vlg/assets/128126577/523f4b03-ade3-4ecd-b09e-341f7130c86b) 28 | ![image](https://github.com/shivank21/1y-recruitment-vlg/assets/128126577/8e553897-8fdd-4f62-b72b-21eef9bd1fa9) 29 | 30 | ## Two Cents 31 | The paper's contribution signifies a notable advancement in the field of generative AI, enabling enhanced control and ethical oversight over the outputs of diffusion models. It is noteworthy that certain instances demonstrate a deterioration in the outputs of analogous concepts following ablation, suggesting opportunities for further research in this area. 32 | -------------------------------------------------------------------------------- /summaries/Adversarial_RL.md: -------------------------------------------------------------------------------- 1 | ## ADVERSARIAL POLICIES : ATTACKING DEEP REINFORCEMENT LEARNING 2 | 3 | Adam Gleave, Michael Dennis, Cody Wild, Neel Kant, Sergey Levine, Stuart Russell, **ICLR 2020** 4 | 5 | ## Summary 6 | 7 | The discovery of adversarial examples for image classifiers prompted a new field of research into adversarial attacks and defenses. Recent work has shown that deep RL policies are also vulnerable to adversarial perturbations of image observations. However, real-world RL agents inhabit natural environments populated by other agents, including humans, who can only modify another agent’s observations via their actions. The authors explore whether it’s possible to attack a victim policy by building an adversarial policy that takes actions in a shared environment, inducing natural observations which have adversarial effects on the victim. 8 | 9 | In real world RL domains, an attacker cannot usually directly modify the victim policy’s input. As a proof of concept, we show the existence of adversarial policies in zero-sum simulated robotics games with [proprioceptive observations](https://arxiv.org/abs/1710.03748). The state-of-the-art victim policies were trained via self-play to be robust to opponents. We train each adversarial policy using model-free RL against a fixed black-box victim. We find the adversarial policies reliably beat their victim, despite training for less than 3% of the timesteps initially used to train the victim policies. Critically, we find the adversaries win by creating natural observations that are adversarial, and not by becoming generally strong opponents. Qualitatively, the adversaries fall to the ground in contorted positions, as illustrated in Figure 1, rather than learning to run, kick or block like normal opponents. This strategy does not work when the victim is ‘masked’ and cannot see the adversary’s position, suggesting that the adversary succeeds by manipulating a victim’s observations through its actions. 10 | 11 | 12 | 13 | We find that victim policies in higher-dimensional Humanoid environments are substantially more vulnerable to adversarial policies than in lower-dimensional Ant environments. We find adversarial policies induce significantly different activations than normal opponents, and that the adversarial activations are typically more widely dispersed between timesteps than normal activations. 14 | 15 | A natural defense is to fine-tune the victim against the adversary. We find this protects against that particular adversary, but that repeating the attack method finds a new adversary the fine-tuned victim is vulnerable to. However, this new adversary differs qualitatively by physically interfering with the victim. This suggests repeated fine-tuning might provide protection against a range of adversaries. 16 | 17 | ## Main Contributions 18 | 19 | - This paper proposes a novel, physically realistic threat model for adversarial examples in RL. 20 | 21 | - This paper demonstrates the existence of adversarial policies in this threat model for several simulated robotics games. Our adversarial policies reliably beat the victim, despite training with less than 3% as many timesteps and generating seemingly random behavior. 22 | 23 | - This paper provides a detailed analysis of why the adversarial policies work. Also, it shows that they create natural observations that are adversarial to the victim and push the activations of the victim’s policy network off-distribution. Additionally, we find policies are easier to attack in high-dimensional environments. 24 | 25 | ## Our two cents 26 | 27 | - The paper provides an introduction to the adversarial perturbations that can affect the victim’s observations by using a physically realistic threat model without any direct modification. 28 | 29 | - Adversarial policies highlight the need to move beyond training techniques involving only self-play. 30 | 31 | - We are tempted to focus more on the safety of our RL policies, as even an observation like "the opponent just collapsing" can smash our hopes to the ground. 32 | 33 | ## Resources 34 | 35 | - [Physically Realistic Attacks on Deep Reinforcement Learning (Blog)](https://bair.berkeley.edu/blog/2020/03/27/attacks/) 36 | 37 | - [This AI Does Nothing In Games…And Still Wins! (Youtube)](https://www.youtube.com/watch?v=u5wtoH0_KuA) -------------------------------------------------------------------------------- /summaries/Attention_Is_All_You_Need.md: -------------------------------------------------------------------------------- 1 | # Attention Is All You Need 2 | 3 | Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, NIPS 2017 4 | 5 | ## Summary 6 | 7 | This paper introduced a neural network architecture called the Transformer, which revolutionized NLP tasks by outperforming the existing RNNs & CNNs based architecture.The most famous current models that are emerging in NLP (for example, GPT-3 or BERT) tasks consist of dozens of transformers or some of their variants. 8 | 9 | ## Contributions 10 | 11 | - The Transformer, eschewing recurrence (in which ,the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions) and instead relying entirely on an attention mechanism to draw global dependencies between input and output (reducing it to constant number of operations). Thereby, omitting memory constraints & with significantly more parallelization and little time to train, it outperformed the existing architecture. 12 | 13 | - The application of this architecture is not only limited to NLP, but similar transformer based architecture is being used in fields like Computer vision & Reinforcement learning. 14 | 15 | ## Method 16 | 17 | 18 | 19 | - The Transformers are attention based encoder-decoder type architecture. 20 | 21 | - **Encoding block** contains certain number of encoder layers, which further contains two sub-layers, Self attention layer & feed-forward network. Words in our input sequence are embedded and position encoded, each of them flows through each of the two layers of the encoder. 22 | - The Self attention layer helps the encoder to understand the dependence/relevance of a specific word being encoded with other words in the input sentence. 23 | 24 | 25 | - This attention calculation process is done in a Multi-Head fashion and the output is further concatenated. 26 | 27 | - Residual connection around each of the two sub-layers is provided, followed by layer normalization. Further the output is fed in the FFN (preceding layers are connected to the current layer, and its output is connected to the input of all the subsequent layers). 28 | 29 | - **Decoding block** processes the output of the top encoder layer with the help of an additional layer, encoder-decoder layer which performs multi-head attention over the output of the encoder stack (helps the decoder focus on appropriate places in the input sequence). 30 | - Similar to encoding block, it is also composed of stacks of decoder layer, containing the two sub-layers along with encoder-decoder layer. 31 | 32 | - Further the output vector is converted into scores and then into probabilities. The cell with the highest probability is chosen, and the word associated with it is produced as the output (of that time step). 33 | 34 | ## Results 35 | 36 | 37 | 38 | ## Our Two Cents 39 | 40 | - The model relies on positional encodings to understand the order of tokens in a sequence. While this works well for many tasks, it might not capture long-range dependencies effectively. 41 | 42 | - This Transformer architecture is designed for text data. Adapting it to handle other modalities can be challenging. 43 | 44 | ## Resources 45 | 46 | - [Breakdown of the transformer architecture (Blog)](http://jalammar.github.io/illustrated-transformer/) 47 | - [YouTube video](https://youtu.be/4Bdc55j80l8?si=FyyDYl3CG3ViOHJx) 48 | 49 | 50 | 51 | 52 | -------------------------------------------------------------------------------- /summaries/BART_Denoising_Sequence_to_Sequence_Pretraining_for_Natural_Language_Generation_Translation_and_Comprehension.md: -------------------------------------------------------------------------------- 1 | # BART: Denoising Sequence-to-Sequence Pretraining for Natural Language Understanding and Generation 2 | 3 | **Authors:** Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, Luke Zettlemoyer 4 | 5 | **Conference:** ACL (Association for Computational Linguistics) 2020 6 | 7 | ## Summary 8 | BART, short for Bidirectional and Auto-Regressive Transformers, represents a groundbreaking development in the field of natural language processing. This paper introduces a versatile denoising autoencoder designed for pretraining sequence-to-sequence models. The primary objective of BART is to address the inherent challenges in NLP tasks, including text generation and comprehension. The model achieves this through a unique combination of bidirectional encoding and left-to-right autoregressive decoding within a Transformer-based architecture. 9 | 10 | ## Contributions 11 | - BART introduces an innovative denoising autoencoder framework that significantly advances the field of NLP. 12 | - The model's performance in text generation and comprehension tasks sets new benchmarks, with broad applicability across NLP applications. 13 | - BART's design enables fine-tuning on various NLP tasks, making it an indispensable tool for researchers and practitioners. 14 | 15 | ## Method 16 | BART's architecture is built upon the well-established Transformer neural network. This architecture encompasses both bidirectional encoding and left-to-right autoregressive decoding. During the pretraining phase, BART follows a two-stage process: text corruption and model learning for text reconstruction. Diverse text corruption techniques, including token masking, token deletion, text infilling, sentence permutation, and document rotation, are employed to improve the model's understanding of contextual information and text generation capabilities. 17 | 18 | Mathematical equations are an integral part of optimizing the negative log likelihood of the original document during pretraining. The fine-tuning phase encompasses a range of NLP tasks, such as sequence classification, token classification, sequence generation, and machine translation, where BART adapts its learned representations to specific task requirements. 19 | 20 | ## Results 21 | BART outperforms existing models on a variety of NLP tasks, particularly excelling in text generation tasks, including abstractive summarization, dialogue response generation, and question answering. Notably, it achieves remarkable gains, with up to a 6-point improvement in ROUGE scores for summarization tasks and a 1.1 BLEU increase over back-translation systems for machine translation. Importantly, BART consistently delivers strong performance across the spectrum of NLP tasks. 22 | 23 | ## Comparisons with Baselines 24 | In direct comparisons with various pretraining objectives, including conventional language models, permuted language models, and masked language models, BART consistently demonstrates its strength and adaptability across different tasks. It showcases that its unique design significantly enhances its performance. 25 | 26 | ## Two-Cents 27 | BART's introduction is a landmark moment in NLP, as it successfully marries the advantages of bidirectional encoding with autoregressive decoding. Its adaptability for fine-tuning on a wide array of NLP tasks is a testament to its versatility. BART's remarkable performance in text generation tasks, including summarization and dialogue response, is a testament to its efficacy. While BART has raised the bar in NLP research, future endeavors might explore novel text corruption techniques and further optimization for specific applications. 28 | 29 | ## Resources 30 | - [BART Paper (PDF)](https://paperswithcode.com/paper/bart-denoising-sequence-to-sequence-pre) 31 | 32 | -------------------------------------------------------------------------------- /summaries/Big_Bird_Transformers.md: -------------------------------------------------------------------------------- 1 | # Big Bird: Transformers for Longer Sequences 2 | Manzil Zaheer, et al. , NeurIPS 2020 3 | 4 | ## Summary 5 | 6 | The paper presents BIGBIRD, a Transformer-based model designed to overcome the quadratic dependency on sequence length characteristic of such models. BIGBIRD introduces a sparse attention mechanism, reducing this dependency to linear. The model maintains all the theoretical properties of a full Transformer, while improving the ability to handle longer sequences due to its elaborate design, thus improving performance on various NLP tasks and expanding its applicability to DNA sequence representation. 7 | 8 | ## Contributions 9 | 10 | 1. Theoretical Advancements in Transformer Model: The BigBird model satisfies all known theoretical properties of a full transformer. Particularly, the addition of extra tokens allows it to express all continuous sequence-to-sequence functions with only O(n) inner products. Furthermore, under standard precision assumptions, BigBird proves to be Turing complete. 11 | 12 | 2. Empirical Benefits Proven on Various NLP tasks: BigBird has been demonstrated to benefit a variety of NLP tasks due to its modeling of extended context. It has delivered state-of-the-art results on different datasets for tasks such as question answering and document summarization, displaying an impressive empirical improvement over prior methods. 13 | 14 | 3. Innovative Application in Genomics: A pioneering application of attention-based models is introduced where long contexts are beneficial. BigBird is utilized for extracting contextual representations of genomics sequences such as DNA. With extensive masked Language Model (LM) pretraining, BigBird has shown improved performance on downstream tasks like promoter-region and chromatin profile prediction, hinting at its promising applicability beyond the realm of traditional NLP tasks 15 | 16 | 17 | ## Method 18 | 19 | BIGBIRD uses a generalized attention mechanism in each layer of the Transformer operating on an input sequence. This sparse attention mechanism is implemented as a sparsification of a directed graph, where each query attends over r randomly chosen keys. Changing the attention mechanism to a sparse model introduces a cost, and any sufficiently sparse mechanism will require more layers. BIGBIRD resolves this complexity by introducing global tokens and utilizing a sliding window to capture local structures, effectively constructing a sparse random graph which has logarithmic shortest path between nodes. 20 | 21 | ![BIGBIRD](https://miro.medium.com/v2/resize:fit:750/format:webp/1*4iAjyRtn65NAP-Oxm_PwLg.png) 22 | 23 | ## Results 24 | 25 | BIGBIRD demonstrates significant improvement on various NLP tasks on different datasets. In pretraining, it outperforms models like RoBERTa and Longformer. In question answering and document classification tasks, especially on long documents, BIGBIRD's performance surpasses other models. In generative tasks like document summarization, it was found that modeling longer context leads to significant improvement. In genomics, the new model exhibits high performance in tasks related to DNA sequence interpretation, thereby demonstrating the model's broad applicability beyond NLP. 26 | 27 | ## Two Cents 28 | 29 | While it's acknowledged that sparse attention mechanisms do not serve as a universal replacement for dense attention mechanisms, and that there is a cost involved in switching to a sparse attention mechanism, the potential of the BigBird model in transforming the field of Natural Language Processing (NLP) is undeniable. Despite these trade-offs, the revolutionizing capability of BigBird, particularly in handling longer sequence lengths and its successful application across various NLP tasks, signify substantial progress and have the potential for significant, far-reaching impacts in NLP. 30 | 31 | ## Resources 32 | 33 | - Link to Paper : https://arxiv.org/pdf/2007.14062.pdf 34 | - Other resources and blogs : https://www.analyticsvidhya.com/blog/2022/11/an-introduction-to-bigbird/ 35 | https://towardsdatascience.com/understanding-bigbird-is-it-another-big-milestone-in-nlp-e7546b2c9643 36 | - Code : https://github.com/google-research/bigbird 37 | -------------------------------------------------------------------------------- /summaries/CICERO.md: -------------------------------------------------------------------------------- 1 | # Human-level play in the game of Diplomacy by combining language models with strategic reasoning 2 | 3 | Meta Fundamental AI Research Diplomacy Team (FAIR), Antin Bakhtun, Noam Brown, Emily Dinan **Science Journal 2022** 4 | 5 | ## Summary 6 | 7 | Unlike competitive games like Chess and Go, which computers have mastered through self-play, the game Diplomacy has been a different challenge because it involves major human elements 8 | like cooperation and understanding other players' motivations. CICERO is the first AI agent to achieve human-level performance in Diplomacy.
9 | CICERO uses a language model integrated with a strategy model by understanding conversations happening during the game and generating dialogue to achieve its objectives. 10 | 11 | ## Diplomacy 12 | In Diplomacy, seven players conduct private natural language negotiations to coordinate their actions to cooperate and compete. 13 | Its main distinctions from most board wargames are its negotiation phases. Social interaction and interpersonal skills make up an essential part of the play. 14 | The rules that simulate combat are strategic, abstract, and simple—not tactical, realistic, or complex—as this is a diplomatic simulation game, not a military one. 15 | 16 | ## Contributions 17 | 18 | - Diplomacy is unique because it requires building trust with others in an environment that encourages players not to trust anyone. 19 | - Cicero couples a **dialogue module** with a **strategic reasoning engine**. 20 | - Cicero uses the dialogue and state of the board as a starting point for a planning algorithm that uses RL-trained models to predict the possible human actions of each participant. 21 | 22 | ## Method 23 | 24 |
25 | - In addition to the classic RL principles, Cicero uses anchor policies governed by a Dialogue-conditional action model, which act as behaviour cloning policies.
26 | In the absence of these anchor policies, the actions of our model become 'non-human', where the model could devise an optimal strategy but which 27 | would not fit a human environment. 28 | - The model defines a message to have intent **z** if **z** is the most likely set of actions the sender and the recipient will take. 29 | - The model uses **piKL** algorithm to predict player policies.
30 | $${U_i(π_i,π_{−i})=u_i(π_i,π_{−i})−λD_{KL}(π_i∥τ_i)}$$ 31 |
32 | where π_i is the policy of player i, π_{−i} the policy of all players other than i, λ is the anchor strength.
33 | - The parameter λ is interesting in that it is the measure of the "human-ness" of a policy, and the model uses the high value of λ when predicting players' moves 34 | and a lower weight while making its own. 35 | 36 | ## Results 37 | 38 | - Across 40 games of an anonymous online Diplomacy league, Cicero achieved more than double the average score of the human players and ranked in the top 10% of participants who played more than one game. 39 | - Comparisons with the baselines.
40 |
41 | 42 | ## Two-Cents 43 | 44 | - The model is an excellent development in areas with a cooperative element, and the authors also say that Cicero is not about Diplomacy; it has been used as a benchmark. 45 | - The **dialogue module** and the **strategic reasoning engine** of the model are more disjoint than we would hope, and it would be interesting to see that improve in the future. 46 | - The λ parameter does show us that even in Diplomacy, the more strategic policies are better than "human-like" policies, which tend to include human emotions. 47 | 48 | ## Resources 49 | 50 | Paper: https://www.science.org/doi/10.1126/science.ade9097 51 |
Blog by Meta AI: https://ai.facebook.com/blog/cicero-ai-negotiates-persuades-and-cooperates-with-people/ 52 | -------------------------------------------------------------------------------- /summaries/CLIP.md: -------------------------------------------------------------------------------- 1 | # CLIP (Contrastive Language–Image Pre-training) 2 | 3 | Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever
4 | **ICML** **2021** 5 | 6 | ## Summary 7 | 8 | CLIP (Contrastive Language–Image Pre-training), is a pre-training method that uses the task of predicting text-image pairings to learn image representations from raw text. The paper shows that CLIP enables zero-shot transfer of the model for various tasks, without the need for dataset-specific training. CLIP is part of a group of papers revisiting learning visual representations from natural language supervision in the past year. This line of work uses more modern architectures like the Transformer and includes ViTex, ICMLM, CovVIRT. 9 | 10 | ## Contributions 11 | 12 | Learning Transferable Visual Models From Natural Language Supervision was the name of the paper this was introduced. In order to solve the task of zero-shot transfer, our intuition is that CLIP models will need to learn to recognize a wide variety of visual concepts in images and associate them with their names. As a result, CLIP models can then be applied to nearly arbitrary visual classification tasks. 13 | 14 | There is a gap between “benchmark performance” and “real performance”, and generally this gap occurs because the models “cheat” by only optimizing for performance on the benchmark, much like a student who passed an exam by studying only the questions on past years’ exams. In contrast, the CLIP model can be evaluated on benchmarks without having to train on their data, so it can’t “cheat” in this manner. 15 | 16 | ## Method 17 | 18 | It is a joint image and text embedding model trained using 400 million image and text pairs in a self-supervised way. This means that it maps both text and images to the same embedding space. So, for example, an image of a dog and the sentence “an image of a dog” would end up having very similar embeddings and be close to each other in the vector space. 19 | 20 | 21 | It consists of a text encoder (which embeds the text), which is a standard transformer and an image encoder (which embeds the images), for which they use multiple approaches such as using ResNET-50. Then another major component is the Multi-Modal Embedding Space, and we can train CLIP (by training the image and text encoders) to put these in the space such that positive pairs are close to each other. 22 | There are a lot of ways to define “close” in machine learning. Arguably the most common approach is cosine similarity, which is what CLIP employs. 23 | 24 | ## Results 25 | 26 | This proved to be much more successful than standard models, as it was able to get similar type of performance on the dataset it was trained on in comparison to other Image Models, but on other datasets with different distribution it was able to perform better than those models, showing that it has better generalizing capabilities and zero-shot transfer abilities. 27 | 28 | 29 | 30 | Because they learn a wide range of visual concepts directly from natural language, CLIP models are significantly more flexible and general than existing ImageNet models. Hence, they find that CLIP able to zero-shot perform many different tasks. It solves the main issue of having to not train a model again and again specific to a dataset, removing the need for Costly datasets and adds "out of the box" recognition abilities. 31 | 32 | But there are limitations to this as well, it struggles on more abstract or systematic tasks such as counting the number of objects in an image and on more complex tasks such as predicting how close the nearest car is in a photo. On these two datasets, zero-shot CLIP is only slightly better than random guessing. 33 | 34 | ## Our Two-Cents 35 | 36 | This paper was really important for handling vision language tasks, as it brings on the images and the text to the same embedding space allowing the images to derive and obtain context and meaning with the help of word embeddings from natural language models.
37 | They created there dataset by web scraping and using captions of images from the internet, which is not always a reliable source and there exists a lot of biases and fake information on the internet causing more issues, hence there exists a lot of scope of improvement. People can look into using a more reliable dataset using human preferences, look for alternate methods on similar lines where they add more dimensionality to the process, and also look forward on the interpretability on this to actualize how the similarity is being bought upon. 38 | 39 | ## Resources 40 | 41 | - Paper Link: https://arxiv.org/pdf/2103.00020 42 | - Blog: https://openai.com/index/clip/ -------------------------------------------------------------------------------- /summaries/DIRE.md: -------------------------------------------------------------------------------- 1 | 2 | # DIRE for Diffusion-Generated Image Detection 3 | Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, Houqiang Li **ICCV** **2023** 4 | 5 | ## Summary 6 | 7 | 8 | This paper, seeks to build a detector for telling apart real images from diffusion generated images by proposing a novel image representation called **DI**ffusion **R**econstruction **E**rror (DIRE), which measures the error between an input image and its reconstruction counterpart by a pre-trained diffusion model. The hypothesis behind DIRE is the observation that images produced by diffusion processes can be reconstructed more accurately by a pre-trained diffusion model compared to real images. 9 | 10 | 11 | ## Contributions 12 | 13 | 14 | - Proposed a novel image representation called DIRE for detecting diffusion-generated images. 15 | - Set up a new dataset, DiffusionForensics (including three-domain images (LSUN-Bedroom, ImageNet and CelebA-HQ)) generated by eleven different diffusion models for benchmarking the diffusion-generated image detectors. 16 | 17 | 18 | ## Method 19 | 20 | 21 | Given an input image x0 to judge whether it is generated by diffusion models, we take a pre-trained diffusion model and apply the DDIM inversion process to gradually add Gaussian noise into x0. Then the DDIM generation process is employed to reconstruct the input image and produces a recovered version x'0. Then the DIRE is defined as: 22 | 23 | 24 | $$ 25 | DIRE(x_{0}) = |x_{0} - x'_{0}| 26 | $$ 27 | 28 | **Illustration of the difference between a real sample and a generated sample** 29 | 30 | pg(x) represents the distribution of generated images while pr(x) represents the distribution of real images. xg and xr represent a generated sample and a real sample, respectively. Using the inversion and reconstruction process of DDIM xg and xr become x'g and x′r , respectively. 31 | 32 | As a sample xg from the generated distribution pg(x) and its reconstruction x′g belong to the same distribution, the DIRE value for xg would be relatively low. Conversely, the reconstruction of a real image xr is likely to differ significantly from itself, resulting in a high amplitude in DIRE. 33 | 34 | 35 | Thus for real images and diffusion-generated images, we get their DIRE representations and train a binary classifier to distinguish their DIREs using binary crossentropy loss. 36 | 37 | 38 | ## Results 39 | 40 | - DIRE with a binary classifier significantly outperformed existing classifiers including CNNDetection, GANDetection, SBI, PatchForensics, F3Net at detecting - 41 | * Diffusion generated bedroom images 42 | * Diffusion generated face images 43 | * Generated ImageNet images 44 | * GAN-generated bedroom images 45 | 46 | - The robustness of detectors is checked in two-class degradations, Gaussian blur and JPEG compression, DIRE gets a perfect performance without performance drop. 47 | 48 | - Other methods of input also checked against DIRE were RGB images, reconstructed images (REC), and the combination of RGB and DIRE (RGB&DIRE). Using just DIRE as input achieved significantly higher accuracy 49 | 50 | 51 | ## Two-Cents 52 | 53 | The proposed image representation DIRE contributes to a novel, accurate and robust detector, outperforming current SOTA detection models extensively. 54 | 55 | ## Resources 56 | 57 | - [Paper](https://arxiv.org/pdf/2303.09295.pdf) 58 | - [Implementation](https://github.com/ZhendongWang6/DIRE) -------------------------------------------------------------------------------- /summaries/DoodlerGAN summary.md: -------------------------------------------------------------------------------- 1 | # CREATIVE SKETCH GENERATION 2 | Songwei Ge, Vedanuj Goswami & C. Lawre1113ce Zitnick, Devi Parikh, ICLR 2021 3 | ### Summary 4 | In this paper DoodlerGAN is proposed, which is a part-based Generative Adversarial Network (GAN). SketchRNN was the early approach to this direction with a a sequence-to-sequence Variational Autoencoder (VAE) later approaches incorporated a convolutional encoder to capture the spatial layout of sketches. SketchGAN adopted a conditional GAN model as the backbone to generate the missing part. 5 | ### Main points of the paper 6 | - DoodlerGAN is a part-based GAN which draws any doodle part by part instead of drawing as whole. 7 | - DoodlerGAN instead of mimicking human drawings, it is used for sketches that are not seen in human drawings. 8 | - DoodlerGAN basically consists of 2 parts 9 | - Part Selector module 10 | - Part Generator module 11 | - Part Selector predicts which part to be drawn next whereas Part Generator draws the raster image of the part. 12 | - Part generator is a Conditional GAN based upon the styleGAN2 Architecture. It uses a 5-layer styleGAN2 generator with [512,256,128,64,32] output channels starting from 4x4x64 constant feature map. For encoding the input partial sketches, it uses a 5-layer CNN with [16, 32, 64, 128, 256] output channels. It downsample the intermediate feature map after each encoder layer and concatenate it channel-wise with the corresponding layers in the generator, giving it access to the hierarchical spatial structure of the input sketch. 13 | 14 | [![image](https://user-images.githubusercontent.com/76916164/127908303-f2b3def9-6397-4a5b-b70f-dc8af145f6c3.png)](https://github.com/Sandstorm831/papers_we_read/blob/master/images/DoodlerGAN%20architecture.png) 15 | 16 | 17 | 18 | - The input sketch is represented by stacking channels for each part the sketch while a separate channel for the entire sketch. 19 | - It follows the Discriminator architecture from the styleGAN2 architecture. Discriminator is given access to the input sketch and the corresponding part channels so that it can distinguish between real and fake parts corresponding to both the shape and position of the parts relative to other parts, whereas the input to the discriminator is an image with (no. Of parts + 1) channels. 20 | - For part selector, similar architecture as the encoder, just a linear layer added at the end. 21 | 22 | [![image](https://user-images.githubusercontent.com/76916164/127901684-109844f1-3f08-460e-8169-e21faa89bd00.png)](https://github.com/Sandstorm831/papers_we_read/blob/master/images/Output%20images.png) 23 | 24 | ### Conclusion 25 | Although, sketches by DoodlerGAN are quite creative and diversified, it can surpass human test in case of creative birds but not in case of creative creatures. 26 | 27 | [![image](https://user-images.githubusercontent.com/76916164/127901759-187eeafb-0626-47e1-adda-42dfe79cb0c2.png)](https://github.com/Sandstorm831/papers_we_read/blob/master/images/Conclusion.png) 28 | 29 | Thus, leaving room for the improvement and scope in generating more complex creative sketches. 30 | 31 | -------------------------------------------------------------------------------- /summaries/FFA-Net.md: -------------------------------------------------------------------------------- 1 | # PAPER_TITLE: Feature Fusion Attention Network for Single Image Dehazing 2 | Authors:Qin, Xu and Wang, Zhilin and Bai, Yuanchao and Xie, Xiaodong and Jia, Huizhu 3 | YEAR_OF_PUBLICATION=2020 4 | Publication=AAAI_2020 5 | 6 | 7 | ## Summary: 8 | This paper presents the Feature Fusion Attention Network (FFA-Net) for dehazing a single image, a crucial task in computer vision applications. FFA-Net consists of three essential elements: a Feature Attention (FA) module that combines Channel Attention and Pixel Attention, a basic block structure that incorporates Local Residual Learning and Feature Attention, and multi-layer feature fusion that adaptively learns feature weights to improve information preservation. Quantitatively and qualitatively, the accuracy demonstrated by the experimental results surpasses that of previous methods. FFA-Net demonstrates a novel approach to information integration, with applications that extend beyond dehazing. 9 | 10 | 11 | ## Contributions : 12 | 13 | - Introduction of FFA-Net, a new single image dehazing technique. 14 | - Channel Attention and Pixel Attention integration within the Feature Attention module. 15 | - Utilization of a fundamental block structure to improve the concentration of information. 16 | - Adaptive learning of feature weights for multi-layer feature fusion. 17 | - Significant accuracy enhancements relative to existing methodologies. 18 | 19 | ## The method: 20 | The FFA-Net method consists of three fundamental components: 21 | - Feature Attention (FA) Module: Combines Channel Attention and Pixel Attention mechanisms to effectively manage diverse forms of data. 22 | - Include Local Residual Learning and Feature Attention to emphasize relevant information. 23 | - Multi-layer Feature Fusion: Learns adaptive feature weights, preserving shallow layer information and transferring it to deeper layers for enhanced fusion efficiency. 24 | 25 | ## Results: 26 | 27 | - Utilizing Peak-Signal-to-Noise-Ratio (PSNR) and Structural Similarity Index (SSIM) to evaluate performance. 28 | - Quantitatively and qualitatively, FFA-Net outperforms previous techniques. 29 | > On the SOTS indoor test dataset, the PSNR metric improved from 30.23 to 36.39 dB. 30 | 31 | - Enhanced precision and quality of vision in dehazed images. 32 | - FFA-Net outperforms contemporary single image dehazing techniques. 33 | - PSNR and SSIM metric enhancements relative to extant baselines. 34 | - The demonstration of qualitative superiority through visual comparisons. 35 | 36 | ## Two cents: 37 | > This paper provides a compelling solution to the difficult problem of single image dehazing. The innovative combination of Channel Attention, Pixel Attention, and feature fusion techniques in FFA-Net results in significant accuracy enhancements. In addition, the adaptability of FFA-Net's feature fusion mechanism allows for applications beyond dehazing, including denoising, deraining, and super-resolution. Future research could investigate the scalability and generalizability of FFA-Net in order to apply it to a broader spectrum of low-level vision tasks. 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | -------------------------------------------------------------------------------- /summaries/GANcraft.md: -------------------------------------------------------------------------------- 1 | # GANcraft: Unsupervised 3D Neural Rendering of Minecraft Worlds 2 | Zekun Hao, Arun Mallya, Serge Belongie, Ming-Yu Liu 3 | 4 | ## Summary 5 | 6 | 7 | 8 | This paper presents **GANcraft** a volumetric rendering (rendering of the 3D world as 2D images) based approach to model a 3D block world scene with semantic labels as a continuous volumetric function and render view consistent, photorealistic images. In the absence of the paired training data, an image-to-image translation model generates the pseudo ground truth labels for the corresponding photorealistic 3D world. 9 | 10 | ## Contributions 11 | 12 | 13 | - The novel task of world-to-world translation from 14 | 3D block world which can be intuitively constructed in 15 | Minecraft to a realistic 3D world, the 3D extension to a 2D 16 | image to image translation. 17 | 18 | - Framework to train neural renderers in the absence of ground 19 | truth data for rendering the realistic-looking world using pseudo 20 | ground truth labels. 21 | 22 | - Training neural rendering architecture with adversarial losses 23 | and conditioning on the style image, extending 3D neural rendering 24 | methods. 25 | 26 | ## Model 27 | 28 | 29 | 30 | - GANs can successfully map images from one domain to another without paired data but the images generated for mapping a 3D world are not view-consistent and there is a flickering problem. 31 | Neural rendering techniques solve this problem of view consistency but cannot handle the Minecraft block world and real-world domain gap. 32 | - The model takes as input a 3D block world for which a voxel bounded feature field is learned using an MLP model that takes as input the location code, semantic label, and a shared style code. Instead of a view-dependent color used in neural rendering techniques, the network outputs image features **C(r,z)**. Vertices of each voxel are assigned feature vectors which are shared by adjacent voxels ensuring that there are no inconsistencies in the output. 33 | - This feature image is passed into a CNN renderer which converts the per pixel feature map to an RGB image. 34 | The model is trained on the adversarial and perceptual losses for the generated image and reconstruction loss wrt the corresponding pseudo ground truth labels. 35 | 36 | ## Results 37 | 38 | 39 | 40 | The model is evaluated based on the FID, KID scores where GANcraft achieves values close to **SPADE** which is a photorealistic image generator and outperforms other baselines on temporal consistency metric based on human preference scores. 41 | 42 | ## Our Two Cents 43 | - The model achieves state-of-the-art results in the world-to-world translation task in the absence of the ground truth photorealistic images for the segmentation labels of the 3D world. 44 | - There is still a blocky appearance to the output images because of the domain shift in the training images of the spade model and the projected images from the 3D block world. 45 | 46 | ## Resources 47 | Project Page: https://nvlabs.github.io/GANcraft/ 48 | -------------------------------------------------------------------------------- /summaries/GIRAFFE.md: -------------------------------------------------------------------------------- 1 | # GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields 2 | Michael Niemeyer, Andreas Geiger 3 | 4 | ## Summary 5 | 6 | 7 | The paper presents a **generative neural feature fields** based approach for representing scenes as compositional generative neural feature fields for the disentanglement of objects from the background as well as individual object shapes and appearances while learning from unstructured and unposed image collections. 8 | The model learns a disentangled representation of individual objects in a scene by using separate latent vectors for each object and allows rotation, translation, and change in the camera pose of objects. 9 | 10 | ## Contributions 11 | 12 | - A novel method for generating scenes in a controllable and photorealistic manner while training from unposed images. 13 | - Incorporating a compositional 3D scene representation directly into a generative model for more controllable image synthesis. 14 | 15 | ## Model 16 | 17 | 18 | 19 | - GIRAFFE builds on the [GRAF](https://arxiv.org/pdf/2007.02442) model, trained on unposed images to represent a scene as a neural feature field conditioned on two latent vectors, the **appearance code: zₐ** and the **shape code: zₛ** sampled randomly from a distribution. 20 | - A key limitation of GRAF is that the entire scene is represented by a single network without explicit control over individual scene objects, their pose, shape, and appearance. GIRAFFE extends the GRAF model architecture where for each object, separate generative neural feature fields are conditioned on randomly sampled shape and appearance latent codes. 21 | - The compositional feature field given by the volumetric density and feature vectors for the background and objects combined is volume rendered for a sampled **camera pose: ε** to obtain a feature image, upsampled by a CNN renderer to get the generated RGB image. 22 | 23 | ## Results 24 | 25 | 26 | Horizontal, vertical translations and rotations performed using shape and appearance latent vector interpolations show the disentanglement of the background and the cars. 27 | 28 | ## Our Two Cents 29 | - The paper presents a novel idea for controllable image synthesis by disentangling individual objects from the background and their shape, appearance, and pose without supervision. 30 | - The model scales well on adding multiple objects in a scene at test time despite being trained on data with a single object instance in an image. 31 | 32 | ## Resources 33 | Project Page: https://m-niemeyer.github.io/project-pages/giraffe/index.html 34 | -------------------------------------------------------------------------------- /summaries/Image_Hijacks.md: -------------------------------------------------------------------------------- 1 | # Image Hijacks: Adversarial Images can Control Generative Models at Runtime 2 | 3 | Luke Bailey,Euan Ong,Stuart Russel,Scott Emmons, **ICML 2024** 4 | 5 | ## Summary 6 | 7 | The paper investigates a security vulnerability in Vision-Language Models called "Image Hijacks." Image hijacks are adversarial images that, when fed into a VLM, can manipulate the model's behavior at inference time, forcing it to produce outputs desired by the attacker. The authors introduce a novel algorithm, "Behavior Matching" for crafting such image hijacks. They then explore various attack scenarios using this technique, demonstrating its effectiveness against LLaVA. Notably, these attacks are automated, require only subtle image perturbations, and can even mimic the effects of specific textual instructions. 8 | 9 | ## Contributions 10 | 11 | - Proposes "Behavior Matching", a general algorithm for training image hijacks. 12 | 13 | - Develops "Prompt Matching", a method derived from Behavior Matching to train hijacks that mimic the behavior induced by arbitrary text prompts. 14 | 15 | - Craft and evaluate four types of image hijacks: Specific string attack, Jailbreak attack, Leak context attack, and Disinformation attack. 16 | 17 | - Through "Ensembled Behavior Matching", demonstrate the potential for creating Tranferable Attacks(single image hijacks effective against multiple VLM models simultaneously). 18 | 19 | ## Method 20 | 21 | **Formalization:** 22 | 23 | The authors formalize a VLM as a function $M_\phi(x, ctx)$ that takes an image $x$ and a text context $ctx$ as input and outputs $out$ (a sequence of logits). 24 | 25 | **Target Behavior:** 26 | The desired behavior is defined as a function $B$ mapping input contexts to target logits sequences ($B: C \to \text{Logits}$). 27 | 28 | **Optimization:** 29 | The algorithm uses projected gradient descent to find an image hijack $\hat{x}$ that minimizes the cross-entropy loss between the VLM's output given the hijack and the target logits over a set of input contexts $C$. 30 | This approach ensures that the generated hijack generalizes across different user inputs, as it's optimized to produce the target behavior for a range of contexts. 31 | 32 | **Mathematical Representation:** 33 | 34 | The goal is to find: 35 | 36 | $$ 37 | \hat{x} = \underset{x \in \text{Image}}{\arg \min} \sum_{\text{ctx} \in C} \left[ L\left(M^\text{force}_\phi(x, \text{ctx}, B(\text{ctx})), B(\text{ctx})\right) \right] 38 | $$ 39 | 40 | where: 41 | 42 | - $\hat{x}$ is the image hijack. 43 | - $\text{Image}$ represents the space of possible images. 44 | - $C$ is a set of possible input contexts. 45 | - $L$ is the cross-entropy loss function. 46 | - $M^\text{force}_\phi(x, \text{ctx}, \text{target})$ represents a teacher-forced VLM, conditioned to return the logits corresponding to the input target. 47 | 48 | 49 | 50 | **Prompt Matching:** 51 | For attacks where defining the target behavior directly through a dataset is difficult (e.g., disinformation attacks), the authors introduce "Prompt Matching". This method leverages the fact that it's often easier to characterize the desired behavior using a specific text prompt. Instead of training on a dataset of (context, target_text) pairs, Prompt Matching trains on (context, target_logits) pairs, where the target logits are generated by the VLM when provided with the target prompt concatenated with the context. 52 | 53 | ## Results 54 | 55 | The authors systematically evaluate the effectiveness of their image hijacks in four attack scenarios: 56 | 57 | - **Specific String Attack:** 58 | The VLM is forced to output a specific string, potentially leading to phishing attacks. They achieve a 100% success rate with $\ell_\infty$-norm constraints as small as $4/255$ and with relatively small stationary patches (60x60 pixels). 59 | 60 | - **Leak Context Attack:** 61 | The VLM is tricked into leaking sensitive information from its context window within an API call. This attack achieves a success rate of up to 96%. 62 | 63 | - **Jailbreak Attack:** 64 | The VLM is manipulated to bypass its safety training and comply with harmful instructions, achieving a maximum success rate of 92%. 65 | 66 | - **Disinformation Attack:** 67 | The VLM is made to provide false information consistently, successfully influencing the model to answer questions as if a factual statement were altered. 68 | 69 | 70 | 71 | ## Our Two Cents 72 | 73 | The paper highlights a critical security vulnerability in VLMs and proposes a generalizable "Behavior Matching" algorithm for crafting image hijacks in white box settings, demonstrating its efficacy across various attack scenarios. Future work should focus on black-box attacks and more robust defense mechanisms. 74 | 75 | ## Resources 76 | 77 | - [Paper](https://arxiv.org/abs/2309.00236) 78 | -------------------------------------------------------------------------------- /summaries/Image_Steganography_ADF-IS.md: -------------------------------------------------------------------------------- 1 | # GAN-based image steganography for enhancing security via adversarial attack and pixel-wise deep fusion 2 | Chao Yuan, Hongxia Wang, Peisong He, Jie Luo, Bin Li 3 | 4 | **Springer Multimedia tools and applications** **2022** 5 | 6 | 7 | ## Summary 8 | This paper proposes an advanced image steganographic scheme using generative adversarial networks (GANs) which includes adversarial attack and pixel-wise deep fusion techniques to fool detection systems and effectively embed and recover hidden information, achieving higher security and better performance compared to CNN based steganalyzers. 9 | 10 | ## Contributions 11 | - This paper presents a novel method called ADF-IS that uses GANs for better security and efficiency of hiding information within images. The scheme is designed to embed secret information in a way that is hard to detect and visually imperceptible. 12 | 13 | - This introduces a technique called UAN which introduces small, targeted changes to images to make it harder and mislead steganalysers to identify hidden data. 14 | 15 | - It involves pixel-wise deep fusion strategy which involves embedding hidden information in complex image areas to maintain high visual quality and make the hidden data harder to detect. 16 | 17 | ## Methodology 18 | The architecture of the ADF-IS model is composed of four main modules: Attack, Encoder, Decoder, and Critic. The embedding process begins with the Encoder module, where a cover image and secret information​ are input to generate a stego mask. Then, a vector sampled from a normal distribution is input into the Attack module which utilizes a generative network UAN to produce an adversarial perturbation. This perturbation is combined with the stego mask and added to the cover image​, resulting in the creation of the stego image. 19 | 20 | 21 | 22 | The Critic module then evaluates this stego image by distinguishing it from the original cover image and providing a score based on the differences. Finally, the stego image is sent to the Decoder module, which recovers the secret information. The method incorporates multiple loss functions and the Wasserstein GAN strategy to improve stability and effectiveness. 23 | 24 | 25 | 26 | ## Results 27 | The proposed method (ADF-IS) achieved high visual quality for stego images. Metrics like PSNR, PSNR-HVS, and PSNR-HVSm values were consistently high, indicating good imperceptibility. The inclusion of the Critic module in adversarial training improved the stability and quality of training, resulting in better overall performance. 28 | 29 | 30 | 31 | ## Two-Cents 32 | - The integration of adversarial attacks with GAN-based steganography represents a significant advancement in the field, enhancing both security and imperceptibility. 33 | 34 | - Developing user-friendly tools and interfaces based on this method can help in practical adoption. 35 | 36 | ## Resources 37 | - Link to [Paper](https://link.springer.com/article/10.1007/s11042-021-11778-z) -------------------------------------------------------------------------------- /summaries/Multi Concept_Customization_of_Text_to_Image_Diffusion.md: -------------------------------------------------------------------------------- 1 | # Multi-Concept Customization of Text-to-Image Diffusion 2 | Nupur Kumari,Bingliang Zhang,Richard Zhang,Eli Shechtman & Jun-Yan Zhu, **CVPR 2023** 3 | 4 | ## Summary 5 | 6 | While text-to-image diffusion models generally perform well, they face challenges with specific or nuanced concepts due to limited training data. The paper suggests a novel approach, introducing a method called custom diffusion to fine-tune pre-trained models. This enables the integration of new concepts with minimal resources and data. 7 | 8 | ![teaser](https://github.com/shivank21/vlg-recruitment-1y/assets/128126577/54447d1c-fd57-4625-88f3-9b109ca9cc34) 9 | 10 | 11 | ## Contributions 12 | 13 | - Demonstrates a quick and computationally efficient fine-tuning process that enables models to generate images of new concepts in detail.(Takes ~6 mins with 2 A100 GPU's) 14 | - Introduces a method for the model to learn several new concepts at once and blend them together in generated images, making the model more useful for handling intricate scenes. 15 | - Shows better performance to other methods in empirical evaluations 16 | - Introduces a new dataset of 101 concepts for evaluating model customization methods along with text prompts for single-concept and multi-concept compositions 17 | 18 | ## Method 19 | 20 | Given a set of target images, the method first retrieves (generates) regularization images with similar captions as target images. The final training dataset is union of target and regularization images. During fine-tuning the method update the key and value projection matrices of the cross-attention blocks in the diffusion model with the standard diffusion training loss. 21 | ![methodology](https://github.com/shivank21/vlg-recruitment-1y/assets/128126577/4145d0a5-a5af-4ce5-9b04-2c617188488b) 22 | 23 | Cross-attention block modifies the latent features of the network according to the condition features, i.e., text features in the case of text-to-image diffusion models.Given text features $c \in \mathbb{R}^{s \times d}$ and latent image features $f \in \mathbb{R}^{ (h\times w) \times l}$, a single-head cross-attention operation consists of $Q = W^q \textbf{f}, K = W^{k} \textbf{c}, V = W^{v} \textbf{c}$, and a weighted sum over value features as: 24 | 25 | $$\begin{equation} 26 | \begin{aligned} 27 | \text{Attention}(Q, K, V) = \text{Softmax}\Big(\frac{QK^T}{\sqrt{d'}} \Big)V, \\ 28 | \end{aligned} 29 | \end{equation}$$ 30 | 31 | where $W^q$, $W^k$, and $W^v$ map the inputs to a query, key, and value feature, respectively, and $d'$ is the output dimension of key and query features. The latent feature is then updated with the attention block output. The task of fine-tuning aims at updating the mapping from given text to image distribution, and the text features are only input to $W^k$ and $W^v$ projection matrix in the cross-attention block.Therefore, the authors propose to only update $W^{k}$ and $W^{v}$ parameters of the diffusion model during the fine-tuning process. 32 | 33 | ## Results 34 | 35 | ### Single-Concept 36 | 37 | ![Tortoise](https://github.com/shivank21/vlg-recruitment-1y/assets/128126577/33a8f690-ab26-48c7-9e7b-ae6962a50cae) 38 | 39 | Where, $V^∗$ is initialized with different rarely-occurring tokens and optimized along with cross-attention key and value 40 | matrices for each layer during fine tuning. 41 | 42 | ### Multi-Concept 43 | 44 | ![Chair_Table](https://github.com/shivank21/vlg-recruitment-1y/assets/128126577/8a93e8c6-d05f-431c-af8f-2a54ff258f27) 45 | 46 | ## Two-Cents 47 | 48 | The research takes a big step forward in tailoring models for turning text into images, but it also points to various possibilities ahead. One might consider expanding this approach to cover a broader range of creative tasks, like generating videos or even creating audio based on text descriptions.However, there are a couple of limitations. Tricky combinations, like having both a pet dog and a pet cat, still pose a challenge. Additionally, putting together three or more concepts becomes a difficult task with this method. 49 | 50 | ## Resources 51 | - Webpage : https://www.cs.cmu.edu/~custom-diffusion/ 52 | - Paper https://arxiv.org/pdf/2212.04488.pdf 53 | 54 | 55 | -------------------------------------------------------------------------------- /summaries/NMT_Gap.md: -------------------------------------------------------------------------------- 1 | # Bridging the Gap between Training and Inference for Neural Machine Translation 2 | 3 | Wen Zhang, Yang Feng, Fandong Meng, Di You, Qun Liu 4 | 5 | ## Summary 6 | 7 | This paper proposes a method to improve the task of Neural Machine Translation by sampling for context words.....predicted words along with ground truth words, choosing the predicted words on either a word-based level or at a sentence level. **This paper also got the best Long paper award at ACL 2019** 8 | 9 | ## Model and Concepts 10 | 11 | - The paper works on a problem called exposure bias which means that while training the model gets context from the ground truth words ...while during inference it has to predict everything by itself, This creates a gap between the distribution predicted words during training and inference. 12 | 13 | - The main motive of the paper was to fix the Overcorrection Phenomenon where if a model predicts a word different from the ground truth...the cross-entropy loss forces it to match the ground truth on the next word ...which causes a problem cause there might be multiple translations of the sentence and so strongly following the ground truth can cause overcorrection. 14 | 15 | - To fix this problem, while training the model ...we sample context words from either the ground truths or the predicted words so that the conditions at training start to match that of inference. 16 | 17 | - The words that are sampled from predictions are called oracle words ....there are two methods for getting oracles ....one is Word-Level Oracle ...other is Sentence Level Oracle. 18 | 19 | - Initially, the context words are chosen from the ground truths to give the model some supervision and to converge better ...after some time the context words get chosen from the oracle words more often to expose the model to the situations encountered during inference. They use a decaying sample rate for this per epoch. 20 | 21 | - For the word level oracle, they take the predicted logits of the previous step ...and something called Gumbel noise to it (for regularization) ..do soft-max and take the maximum probability word. 22 | 23 | - For the sentence level oracle, they perform beam search using the predicted probabilities and restrict it artificially to the size of the ground truth sequence and then select the one with the maximum BLEU score to form the oracle sentence ..and then take its words as oracles for the respective steps. 24 | 25 | - We can also add Gumbel noise during the beam search to further improve performance. 26 | 27 | 28 | 29 | 30 | 31 | ## Main Conclusions and Results 32 | 33 | - Overcorrection Recovery (This model) outperforms other recent approaches to fix exposure bias. 34 | - Adding Gumbel noise improves performance in both sentence and word level oracles as it serves as a sort of regularizer. 35 | - Sentence level oracle performs better than the word-level oracle. 36 | 37 | - The model takes more time to converge then other models ...The authors say that this is because the other models overfit ...while their model does not and hence takes more time to converge. 38 | 39 | - The model works much better than other models in the case of long and super-long sentences ...this may be because the sentence level oracle can help the model to take care of the problems faced due to length. 40 | 41 | - They conducted an experiment to validate that the increase in BLEU score is actually observed due to addressing the exposure bias problem...they did this by counting the ground truth words whose probabilities in the predicted distributions produced by our model are greater than those produced by the baseline model. This came out to be quite high ...around 65% thus validating the claim. 42 | 43 | ## Our Two Cents 44 | 45 | - The method introduced is quite versatile as it can be readily applied to other methods as well ...so it feels like an important contribution to the research society. 46 | 47 | - One weakness that I feel is that the introduction of beam search would lead to a considerable increase in the time taken ...about which no comments are made in the paper, also constraining the size of the beam search to match the ground truth exactly goes against the author's intuition that a sentence can have multiple possible translations. 48 | 49 | 50 | 51 | -------------------------------------------------------------------------------- /summaries/RLHF_Diversity.md: -------------------------------------------------------------------------------- 1 | # UNDERSTANDING THE EFFECTS OF RLHF ON LLM GENERALISATION AND DIVERSITY 2 | 3 | Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette and Roberta Raileanu, **ICLR 2024** 4 | 5 | ## Summary 6 | 7 | The paper explores the impact of Reinforcement Learning from Human Feedback on Large Language Models in terms of two crucial aspects: their ability to generalize to new, unseen data (OOD generalization) and the diversity of their generated outputs compared to Supervised Fine-Tuning(SFT) and Best-of-N sampling(BoN). 8 | 9 | ## Resuls and Analysis 10 | 11 | The paper focuses mainly on the trade-offs between model generalisation—how well an LLM adapts to new, unseen data distributions—and output diversity, defining the range of different outputs the model can generate. 12 | 13 | ### Generalisation: 14 | 15 | - RLHF is shown to enhance both in-distribution(ID) and out-of-distribution (OOD) performance compared to SFT. This is notably observed in instruction following tasks with more considerable distribution shifts . 16 | 17 | - When evaluating summarisation models, RLHF maintains superior performance in comparison to SFT across diverse test datasets. BoN notably outperforms RLHF in summarisation, although BoN incurs significantly higher inference costs. 18 | 19 | ### Diversity: 20 | 21 | - RLHF significantly reduces per-input diversity, which could pose challenges in scenarios where diverse responses are crucial. However, across-input diversity is only slightly affected, indicating that while RLHF limits variation for individual prompts, it preserves some level of diversity across a range of inputs. This phenomenon is anaological to the problem of “mode collapse” often observed in reinforcement learning applications, where the model favors certain patterns at the expense of variety. 22 | 23 | ### Effect of KL-Penalty: 24 | 25 | - While RLHF enhances both in-distribution (ID) and out-of-distribution (OOD) performance, it significantly reduces output diversity compared to SFT. Given that the KL penalty coefficient is designed to keep the RLHF policy closer to the SFT policy, the authors explored whether adjusting this coefficient could balance generalization and diversity. However, the findings indicate that this approach is ineffective—increasing the KL penalty coefficient results in both reduced performance and lower per-input diversity, contrary to what is expected. This highlights the need for further research into more advanced methods to achieve a better balance between generalization and diversity. 26 | 27 | **RLHF Equation:** 28 | $$ R(x, y) = RM_{\theta_{RM}}(x, y) - \beta_{KL} D_{KL}(\pi_{\theta_{RL}}(y|x) \| \pi_{\theta_{SFT}}(y|x)) $$ 29 | Where : 30 | - RM: Reward Model 31 | - $ \theta_{RL}, \theta_{RM}, \theta_{SFT} $: parameters of the RLHF, RM, and SFT models, respectively 32 | - x: input 33 | - y: output 34 | - $\beta_{KL}$: weight of the KL penalty 35 | 36 | **Per-Input Diversity:** 37 | $$ \text{PerInputDiversity}_D(\pi) = \frac{1}{N} \sum_{i=1}^{N} D(O_i^\pi) $$ 38 | - D: Diversity Metric(eg N-grams, sentence-BERT cosine similarity ,NLI diversity) 39 | - $ \pi $: Policy 40 | - N: Number of sample inputs 41 | - $O_i^\pi$ : Set of K outputs from policy $\pi$ to input $i$ 42 | 43 | **Across-Input Diversity:** 44 | $$ \text{AcrossInputDiversity}_D(\pi) = D\left(\bigcup_{i=1}^{N} O_i^\pi\right) $$ 45 | 46 | ### Our Two Cents 47 | 48 | The paper provides insights into RLHF, SFT, and BoN sampling, shedding light on the trade-offs between generalization and diversity in LLM fine-tuning. The trade-offs identified call for strategies that effectively balance these factors without significant compromises. Future research could explore hybrid methods or enhance RLHF with diversity-oriented adjustments. 49 | 50 | ### Resoruces 51 | 52 | - [Paper Link](https://arxiv.org/pdf/2310.06452) 53 | 54 | - [Code](https://github.com/facebookresearch/rlfh-gen-div) 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | -------------------------------------------------------------------------------- /summaries/RainbowMemory.md: -------------------------------------------------------------------------------- 1 | # Rainbow Memory: Continual Learning with a Memory of Diverse Samples 2 | 3 | Jihwan Bang, Heesu Kim, YoungJoon Yoo, Jung-Woo Ha, Jonghyun Choi, **CVPR** **2021** 4 | 5 | ## Summary 6 | 7 | This paper introduces a new memory management strategy called Rainbow Memory to improve Continual Learning, particularly Class Incremental Learning(CIL) with tasks that share classes(Blurry-CIL). It involves two steps. First is ensuring that sampling from stored memory is diverse enough, where diversity is looked at in the context of classification uncertainty of the sample when distorted by various Data Augmentation methods. Second is ensuring sample diversity by Data Augmentation(DA), primarily Mixed-Label DA and Automated DA. 8 | 9 | 10 | 11 | ## Contributions 12 | 13 | - Effective memory management strategy which allows storing old samples to alleviate catastrophic forgetting in CIL. 14 | - Diversity-aware sampling method for effectively managing memory with limited capacity by leveraging classification uncertainty. 15 | - Use of Data Augmentation techniques to enhance the diversity of samples also helps alleviate problems caused by continuously changing class distribution of each task given a task stream. 16 | 17 | ## Method 18 | - Memory Update : 19 | - The authors measure uncertainty by measuring the variance of model output in all samples, slightly modified by data augmentation techniques like color jitter, shear and cutout. 20 | 21 | 22 | - The authors approximate uncertainty by Monte Carlo method of the distribution p(y=c|x) when given prior to perturbed sample $\tilde{x}$ as p($\tilde{x}$|x). Perturbation prior is a uniform mixture of all perturbations. 23 | 24 | $$p(y=c|x) = \int_{\tilde{D}} p(\tilde{x}_t|x)p(y=c|\tilde{x}_t)d\tilde{x}_t$$ 25 | 26 | 27 | $$\approx \frac{1}{A}\sum_{t=1}^A p(y=c|\tilde{x}_{t})$$ 28 | 29 | - x, $\tilde{x}$, y, A, $\tilde{D}$ denote a sample, perturbed sample, label of sample, number of perturbation methods and data distribution defined by perturbed samples respectively. 30 | - The perturbed sample $\tilde{x}$ is drawn from a random function $f_r(.)$, where $\theta_r$ is a hyperparameter which denotes the random factor of the r-th perturbation, as: 31 | $$\tilde{x} =f_r(x|\theta_r), r=1,2,...,R $$ 32 | - The prior $p(\tilde{x}|x)$ is defined as: 33 | $$\tilde{x} \sim \sum_{r=1}^R w_r *f_r(x|\theta_r) $$ 34 | where the random variable $w_r$, r={1,...,R} is drawn from a categorical binary distribution. 35 | 36 | - If u(x) is the uncertainty of the sample with respect to perturbation: 37 | $$u(x)=1 - \frac{1}{T} \max\limits_c S_c $$ 38 | 39 | $$S_c = \sum_{t=1}^T 1_c arg\max_{\tilde{c}} p(y=\tilde{c}|\tilde{x}_t) $$ 40 | 41 | $S_c$ is the number of times class c is predicted as the most likely class, and $1_c$ denotes the binary class indexing vector. 42 | - So if there are k memory slots, and our stored samples from the previous task and fresh samples from the current task add up to m samples, they are first arranged according to u(x), then a sample at an interval of $\frac{m}{k}$ is picked thus forming a new set of stored samples for the next task. 43 | 44 | - Data Augmentation(DA) : 45 | - In mixed-label DA, images of new tasks and old tasks and their labels are mixed in the same proportion; this helps alleviate side effects caused by the change of class distribution over tasks and improves performance. 46 | - Automated DA: composes multiple DAs on model performance under CIL(Class Incremental Learning). 47 | 48 | ## Results 49 | 50 | - This method outperforms all other methods and baselines in blurry CIL and online(buffer unavailable) setups, which are more realistic and practical. 51 | - Even in Disjoint and offline(buffer available) setups, it either outperforms or shows comparable performance to other methods. 52 | - Performance gap decreases as memory size increases as the impact of effective sampling decreases. 53 | 54 | 55 | 56 | - Enhancement provided by CutMix and AutoAug is the most effective among all DAs. 57 | 58 | 59 | 60 | ## Two-Cents 61 | 62 | - This paper proposes a novel idea that helps deal with many realistic issues of Class Incremental Learning and helps make it more valuable in the real world. 63 | 64 | ## Resources 65 | 66 | - [ Paper on Rainbow Memory](https://arxiv.org/pdf/2103.17230.pdf) 67 | - [Code and Data Splits](https://github.com/clovaai/rainbow-memory) -------------------------------------------------------------------------------- /summaries/Semantically_multi-modal_image_synthesis.md: -------------------------------------------------------------------------------- 1 | # Semantically multi-modal image synthesis 2 | 3 | Zhen Zhu, Zhiliang Xu, Ansheng You, Xiang Bai, **CVPR 2020** 4 | 5 | ## Summary 6 | 7 | The paper focuses on *semantically multi-modal image synthesis(SMIS)* task, namely, generating multi-modal images at the semantic level. It proposes a novel network for semantically multi-modal synthesis task, called **GroupDNet (Group Decreasing Network)**. The network unconventionally adopts all group convolutions and modifies the group numbers of the convolutions to decrease in the decoder, considerably improving the training efficiency over other possible solutions like multiple generators. 8 | 9 | For each semantics, there is a specific controller. By adjusting the controller of a specific class, only the corresponding areas are changed accordingly. 10 | 11 | 12 | 13 | **Notation** 14 | 15 | - `M`: semantic segmentation mask 16 | - `C`: number of semantic classes in the dataset 17 | - `H`: image height 18 | - `W`: image width 19 | - `G`: generator 20 | - `Z`: latent code 21 | 22 | For conducting label-to-image translation,`G` requires `M` as conditional input to generate images. However, in order to support multi-modal generation, another input source to control generation diversity. Normally, an encoder is applied to extract `Z` as the controller. Upon receiving these two inputs, the image output `O` can be yielded through `O = G(Z, M )`. However, in the SMIS task, we aim to produce semantically diverse images by perturbing the class-specific latent code which independently controls the diversity of its corresponding class. 23 | 24 | For the SMIS task, the key is to divide `Z` into a series of class-specific latent codes each of which controls only a specific semantic class generation. The traditional convolutional encoder is not an optimal choice because the feature representations of all classes are internally entangled inside the latent code. This phenomenon inspired some architecture modifications in both the encoder and decoder to accomplish the task more effectively. 25 | 26 | ## GroupDNet 27 | 28 | - The main architecture of GroupDNet takes design inspirations from [SPADE](https://arxiv.org/abs/1903.07291). 29 | 30 | - A major modification of GroupDNet is the replacement of typical convolutions to group convolutions to achieve class-specific controllability. 31 | 32 | - GroupDNet contains one encoder and one decoder: 33 | - **Encoder**: Let *Mc* denote the binary mask for class c, *X ∈ RH×W* be the input image, then first the operation: *Xc = Mc . X*, performs feature disentanglement. The input to the encoder is the concatenation of all the *Xc* produced. All convolutions inside the encoder have C groups. The encoded Z is comprised of the class-specific Zc of all classes which serve as the controllers for their corresponding class *c* in the decoding phase. 34 | - **Decoder**: Following the general idea of using all group convolutions in the generator,the convolutional layers in SPADE generator module are replaced with group convolutions resulting in the new module **Conditional Group Normalization (CG-Norm)**. A Conditional Group Block(CG-Block) is made by dynamically merging CG-Norm and group convolutions. 35 | 36 | 37 | 38 | ## Main Contributions 39 | 40 | - **Appearance mixture**: We can gather the distinct styles of a person’s different body parts. Every combination of these styles presents a distinct person image, given a human parsing mask. In this way, we can create thousands of diverse and realistic person images given a person image gallery. 41 | 42 | - **Semantic manipulation**: For eg., inserting a bed in the room or replace the building with trees. 43 | 44 | - **Style morphing**: Feeding two real images to the encoder generates two style codes of these images. By extrapolating between these two codes, we can generate a sequence of images(which are very clear and meaningful) that progressively vary from image a to image b. 45 | 46 | 47 | 48 | ## Our two cents 49 | 50 | - Parameter sharing among classes greatly reduces the required computational power for image generation. 51 | 52 | - Generated images are highly defined because of efficient parameter sharing among classes. So a change in a feature of one class changes other classes accordingly which is really a great step ahead for better image generation. 53 | 54 | ## Resources 55 | 56 | - [Can We Make An Image Synthesis AI Controllable? (Youtube Video)](https://youtu.be/qk4cz0B5kK0) 57 | -------------------------------------------------------------------------------- /summaries/SiamMAE.md: -------------------------------------------------------------------------------- 1 | # Siamese Masked Autoencoders 2 | 3 | **Authors:** Agrim Gupta, Jiajun Wu, Jia Deng, Li Fei-Fei 4 |
5 | **NeurIPS 2023** 6 | 7 | ## Summary 8 | 9 | This paper introduces a neural network architecture called SiamMAE Network, which is a simple extension of Masked Autoencoders (MAE) for learning visual correspondence from videos. Here, the asymmetric masking approach is explored, which encourages the network to model object motion, or in other words, to understand what *went where* 10 | 11 | ## Contributions 12 | 13 | - Asymmetric masking approach is introduced, it creates a challenging self-supervised learning task while encouraging the network to learn temporal correlations (mask a very high ratio, (95% of patches in f2 and 0% in f1)). 14 | - It outperforms many networks when it comes to the task of Video Object Segmentation, Video Part Segmentation, and Pose Tracking without the need for data augmentation, handcrafted tracking-based pretext tasks, or other techniques to prevent representational collapse. 15 | 16 | ## Methodology 17 | 18 | - They utilize Self-supervised visual representation learning as a way to learn generalizable visual representations. Contrastive methods in the image domain encourage models to learn representations by modeling image similarity and dissimilarity or only similarity. 19 | - Here, we use **Masked Autencoders**, which are a type of denoising autoencoder that learns representations by reconstructing the original input from corrupted (i.e., masked) inputs. They have had a transformative impact in masked language modeling and even learning image representations. 20 | - And **Siamese Networks** which are weight-sharing neural networks used to compare entities, have been extensively used in modern contrastive learning approaches. 21 | 22 | #### Key Components 23 | - The Siamese encoders are used to process the two frames independently and the asymmetric masking serves as an information bottleneck. Siamese networks have been used for correspondence learning and often require some information bottleneck to prevent the network from learning trivial solutions 24 | - The output from the encoder is projected using a linear layer and MASK tokens with position embeddings are added to generate the full set of tokens corresponding to the input frame. 25 | - To further increase the difficulty of the task, the two frames are sampled with a larger temporal gap and a small number of patches as input for the second frame resulting in a challenging yet tractable self-supervised learning task. 26 | 27 | #### Working 28 | 29 | 30 | ## Results 31 | 32 | Here, The comparison against prior work is done and shows better performance. 33 | 34 | 35 |
36 | General Qualitative Results are shown here concerning Object propagation, Pose propagation, and Semantic Propagation. 37 | 38 | 39 | ## Our Two Cents 40 | 41 | - The paper introduces how the asymmetric masking strategy is an effective strategy for learning correspondence, and it achieves these competitive results without the need for data augmentation, handcrafted tracking-based pretext tasks, or other techniques to prevent representational collapse. 42 | - Here, the network relies on learning correspondences by operating on pairs of video frames, so adapting it for multiple frames can be a challenge and looked into for its scalability. 43 | 44 | ## Resources 45 | 46 | Project Page: https://siam-mae-video.github.io/ 47 | -------------------------------------------------------------------------------- /summaries/Textual_inversion.md: -------------------------------------------------------------------------------- 1 | # An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion 2 | Rinon Gal1, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, Daniel Cohen-Or 3 | **ICVR** **2023** 4 | 5 | ## Summary 6 | The paper introduces a method which enables users a high level of creative freedom in generating images from natural language prompts by finding new words in the textual embedding space, represented by pseudo-words which helps in introducing new specific concepts into text-to-image models. 7 | 8 | 9 | 10 | ## Contributions 11 | 12 | - Addressing the challenge of large-scale text-to-image models which are constrained by user's ability to describe desired target through text, by introducing personalized text-to-image generation. 13 | 14 | - Introduction of 'Textual inversions' as a method to find pseudo words in embedding space for capturing semantics and visual details. 15 | 16 | - Address the challenges associated with fine-tuning or re-training models for each new concept while preserving model's existing capabilities. 17 | 18 | ## Methodology 19 | 20 | The primary challenge being addressed in the text is the difficulty of introducing new concepts into large-scale text-to-image models. The common challenges faced include the high cost of re-training models with expanded datasets for each new concept and the problem of catastrophic forgetting when fine-tuning on few examples. The proposed solution aims to tackle these challenges by introducing a method called "Textual Inversion" which works by transforming specific concepts into new made-up words. These made-up words can then be used to describe and create new scenes using simple sentences, making it easy for users to make changes. 21 | 22 | 23 | 24 | The approach looks at the word-embedding stage of the text encoders which focuses on visually reconstructing images ensuring that the generated pictures match the desired concepts more accurately. 25 | 26 | ## Results 27 | The method is tested on Latent Diffusion models and the extensive experiments shows the effectiveness of proposed approach in capturing and representing concepts using natural language, especially with a single word and offering good flexibility. 28 | 29 | 30 | 31 | ## Two-Cents 32 | 33 | - The paper provides more freedom in creating images from words. It's good at capturing the general idea of a concept, even though it might not be super precise with details. 34 | 35 | - This method makes it easier to edit pictures, but there's a risk that people might use it to create fake images, especially of private individuals. This suggests a need for more research in this area to understand and address these potential misuses. 36 | 37 | ## Resources 38 | - Link to [Paper](https://arxiv.org/pdf/2208.01618.pdf) and [Code](https://github.com/rinongal/textual_inversion) 39 | - [YouTube video](https://youtu.be/opD_H9bED9Y?si=dMgGO0d_2ClKDUY7) 40 | - [Blog](https://medium.com/@onkarmishra/how-textual-inversion-works-and-its-applications-5e3fda4aa0bc) 41 | 42 | -------------------------------------------------------------------------------- /summaries/ViLBERT.md: -------------------------------------------------------------------------------- 1 | # ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks 2 | 3 | Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee
4 | Georgia Institute of Technology, Oregon State University, Facebook AI Research 5 | 6 | ## Summary 7 | The paper proposes a model for learning task-agnostic joint representation of image content and natural language.It introduces a novel two-stream architecture with co-attentional transformer blocks that outperforms sensible ablations and exceeds state-of-the-art when transferred to multiple established vision-and-language tasks. 8 | 9 | 10 | 11 | ## Model 12 | ViLBERT model consists of two parallel BERT architechtures style models, processing visual and textual inputs.
13 | Each stream is a series of standard encoder transformer blocks(TRM) and novel co-attentional transformer layers(Co-TRM) which is introduced to enable information exchange between the two modalities. 14 | 15 | 16 | 17 | ## Pre-train then transfer 18 | The model is pre-trained using **Conceptual Caption** Dataset on two proxy tasks :
19 |
    20 |
  1. masked multimodal modelling, and
  2. 21 |
  3. multimodal alignment task.
  4. 22 |
23 | 24 | 25 | and then transfer the pretrained ViLBERT model to a set of four established vision-and-language tasks and one diagnostic task.:
26 |
    27 |
  • VQA(Visual Question Answering),
  • 28 |
  • VCR(Visual Commonsense Reasoning)
  • 29 |
  • Grounding Referring Expressions
  • 30 |
  • Caption-Based Image Retrieval
  • 31 |
  • ‘Zero-shot’ Caption-Based Image Retrieval
  • 32 |
33 | Furthermore, transferring our model to these tasks is simple and easy to implement – requiring only the addition of a classifier for each task we examined here. 34 | 35 | ## Implementation Detail 36 | Initialize the linguistic stream of our ViLBERT model with pre-trained BERT. 37 | Uses pretrained Faster R-CNN (with ResNet-101 backbone) to extract region features.Then, select regions where class detection probability exceeds a confidence threshold and keep between 10 to 36 high-scoring boxes. For each selected region i, vi is defined as the mean-pooled convolutional feature from that region. 38 | 39 | ## Our Two Cents 40 | 41 | - The model is setting state-of-the-art results in variours visiolinguistic tasks, and still paper's explanation is extremely lucid and clear ...which makes the paper an easy and intuitive read. 42 | - The paper explores the joint representation for image and text, and reasoning between them.Although we have made much progress progress in visual and linguistic understanding, but there is still much room for progress in relating these modalities. 43 | 44 | ## Implementation 45 | - https://github.com/jiasenlu/vilbert_beta -------------------------------------------------------------------------------- /summaries/Video_Representation_LLM.md: -------------------------------------------------------------------------------- 1 | # Learning Video Representations from Large Language Models 2 | 3 | Yue Zhao, Ishan Misra, Philipp Krähenbüh, Rohit Girdhar, Facebook AI Research- Meta AI, University of Texas, Austin 4 | **2022** 5 | ## Summary 6 | 7 | The approach proposed in this paper, LaViLa (Language-model augmented Video-Language), uses Large Language Models that are repurposed to be conditioned on visual input 8 | and then fine tunes it to create automatic visual narrators that generate dense textual descriptions of videos. 9 | These generated descriptions are then used to contrastively train Dual Encoder models with learned video text embeddings, which outperforms existing models. 10 | 11 | ## Contributions 12 | 13 | - Since the model uses captions generated by an LLM, we can train the Dual Encoder effectively even with a very small fraction of ground-truth dataset. 14 | This is highly beneficial since the amount of readily available densely annotated video clips are very less. 15 | - It provides a very strong alignment between the visual input and generated text. 16 | - It can also expand annotations when they are provided very sparsely, or are not able to capture all details of activities occuring in the video frame. 17 | 18 | ## Method 19 |
20 | - The method LaViLa uses 2 LLM's: a NARRATOR and a REPHRASER for pseudo-text generation. These are pretrained on english words, and are fine-tuned on visual embeddings. 21 | - The Narrator architecture uses a frozen pre-trained LLM, and adds cross-attention modules having text tokens and the visual embeddings as input. 22 | - The Rephraser paraphrases the output generated by Narrator by replacing synonyms or changing word order etc.

23 |

24 | - The pseudo-captions generated by the Narrator and Rephraser, alongwith the ground truth video-text pairs are then used to train the DUAL ENCODER. 25 | It uses contrastive losses such as CLIP, InfoNCE for the same. 26 | 27 | ## Results 28 | LaViLa outperforms the previous state-of-the-art video-language pretraining methods on different datasets such as EK100 MIR, Charades-Ego, EGTEA etc.
29 | The evaluation is done through several protocols, and the approach outperforms the previous SOTA in all cases. 30 | - In *Zero-Shot* protocol, the model is applied to a new set of downstream validation datasets.
31 | [ The two different numbers are obtained on using two different number of frames as input (4-frame and 16-frame respectively) ]. 32 |
33 | - In *Fine-Tuned* protocol, the model is end-to-end fine-tuned on the training split of the target downstream datasets. 34 |
35 | 36 | ## Two-Cents 37 | 38 | - The existing state-of-the-art models for video-language representation lack efficient utilization of a very crucial knowledge base of pretrained LLM's, that include 39 | language intricacies such as conversational ability, factual knowledge, grammatical relations etc. which is well implemented in this approach. 40 | - Improvement of the output generated by Narrator for third person views by training on more relevant datasets and experimentation with different LLM's can be some future areas of work. 41 | 42 | ## Resources 43 | - Paper: https://facebookresearch.github.io/LaViLa/ 44 | - Implementation : https://colab.research.google.com/drive/1gHWiEWywIotRivYQTR-8NQ6GJC7sJUe4 45 | - Demo: https://huggingface.co/spaces/nateraw/lavila -------------------------------------------------------------------------------- /summaries/Vision_Transformer.md: -------------------------------------------------------------------------------- 1 | # An Image is Worth 16X16 Wrods: Transformers for Image Recognition at Scale 2 | 3 | Alexey Dosovitsky, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby **ICLR** **2021** 4 | 5 | ## Summary 6 | 7 | This paper explores the application of attention-based transformer architecture in computer vision. It divides images in patches, flattens them, adds positional embedding and feeds them into the transformer block. It is pretrained on the image classification task, and better performance at lesser compute is observed. 8 | 9 | ## Contributions 10 | 11 | - Successful application of the transformer architecture for vision-related tasks. 12 | - Improving on previous CNN-based SOTA with a fully attention-based architecture that takes lesser computational resources. 13 | 14 | 15 | 16 | ## Method 17 | 18 | - An image of dimensions CxHXW is divided into N patches of dimension CxPxP. This is reshaped into a tensor of dimension NxP^2C. 19 | - Positional embedding is added to all 1D tensors of dimension P^2C and fed into the encoder-only transformer block along with a learnable embedding. 20 | - An MLP classification head containing one hidden layer at pre-training and one layer during finetuning is attached to the learnable token. 21 | - It is pretrained on image classification tasks using datasets of varying sizes, including ImageNet and JFT 300M. 22 | - Layer norm is used before multi-head self-attention block and MLP block, unlike [ Vaswani et al. ](https://arxiv.org/abs/1706.03762), and MLP contains two layers with GELU non-linearity. 23 | - Finetuning is performed on higher resolution images, resulting in larger sequence length. Hence, we perform 2D interpolation of pre-trained positional embeddings. 24 | - Results on downstream tasks are reported either through few shot or finetuning accuracy. 25 | 26 | ## Results 27 | 28 | - ViT-L outperforms previous SOTA, i.e. Big Transfer (BiT) and Noisy Student with similar parameter count with much less compute. ViT-H further improves performance. 29 | - Larger ViT models perform worse than smaller ones if pretrained on smaller datasets despite moderate regularization, but as dataset size increases, larger models perform better. 30 | - ViT overfit more than ResNets of comparable computational cost on smaller datasets but performs better on larger dataset sizes, reinforcing the usefulness of convolutional inductive bias for smaller datasets. 31 | - Hybrids (ViT on convolutional feature maps) perform slightly better than ViT on small computational budgets but the difference vansihes for larger models. 32 | - Learned positional embedding in the same row/column is similar, explaining why hand-crafted 2D aware embedding does not yield a performance boost. 33 | - Some attention heads in lower layers have global attention, while some have highly localized attention that might serve a purpose similar to early convolutional layers. 34 | 35 | 36 | ## Two-Cents 37 | 38 | The transformer-based architecture was very successful in NLP and, based on the results of this paper, performs exceptionally well with images, too, with a lower computational budget. A common architecture for handling different modalities of data is very promising. 39 | 40 | ## Resources 41 | - [Paper](https://arxiv.org/abs/2010.11929) -------------------------------------------------------------------------------- /summaries/W2V-BERT.md: -------------------------------------------------------------------------------- 1 | # w2v-BERT: Combining Contrastive Learning and Masked Language Modelling for Self-Supervised Speech Pre-Training 2 | 3 | Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, Yonghui Wu 4 | 5 | ## Abstract 6 | 7 | The Authors introduce a framework called w2v-BERT for self-supervised speech representation learning. The framework is based on self-supervised learning methods like Contrastive Learning and Masked Language Modelling. Unlike its predecessors, which concatenate separately trained modules, w2v-BERT is optimized using end-to-end training of all its sub-modules. The model is first pre-trained on unlabeled data, followed by fine-tuning on labelled data. 8 | 9 | ## Architecture 10 | 11 | The framework consists of three sub-modules: 12 | 13 | 1. **Feature Encoder** : The feature encoder acts as a convolutional subsampling block that consists of two 2D-convolution layers and generates speech representation feature vectors used by the contrastive module. 14 | 2. **Contrastive module** : This module discretizes the feature encoder output into a finite set of context vectors used by the Masked Prediction module. The feature vector outputs are fed into a linear layer after masking, followed by a stack of conformer blocks, to generate the context vectors. The masked vectors are just replaced with random vectors. The feature vectors are also fed into a quantizer without masking that generates target context vectors (quantized) and corresponding token IDs. These are then used by the Contrastive and Masked Prediction losses, respectively. 15 | 3. **Masked Prediction module** : This module is a stack of conformer blocks that converts the input context vectors into high-level contextualized speech representations. 16 | 17 | 18 |

19 | 20 |
The w2v-BERT pre-training framework

21 | 22 | ## Pre-training Loss Functions 23 | 24 | 1. **Contrastive loss** : It simultaneously trains the contrastive module and the quantizer. The model identifies the true quantized vector for every context vector from a set of K distractor quantized vectors. This generates the contrastive loss along with the codebook diversity loss to encourage uniform usage of codes. 25 | 2. **Masked Prediction loss**: A SoftMax layer at the end of the model tries to predict the corresponding Token ID for every representation vector generated by the Masked Prediction module using cross-entropy loss. 26 | 27 | ## Fine-tuning 28 | 29 | The pre-trained model is further fine-tuned on labelled data using a decoder stack of Swish activation, Batch Norm and 2-layer LSTM. The authors also use self-training (pseudo-labelling), SpecAugment for data augmentation and Language Model (LM) fusion. 30 | 31 | ## Results and Discussion 32 | 33 | Some primary results of the paper are: 34 | 35 | 1. Without self-training and LM, w2v-BERT is already better than models with LM. 36 | 2. Models like w2v-Conformer that only use contrastive loss perform poorer than w2v-BERT, which also uses Masked Prediction loss. 37 | 3. The Contrastive module is necessary. Otherwise, the quantizer and Masked Prediction module can converge to a trivial solution of generating identical token IDs every time. 38 | 4. On fixing the total number of conformer blocks in the model and altering the number of blocks in the contrastive module, there is a performance sweet spot in between where both contrastive and masked prediction modules have optimal capacity for quantized representation learning. 39 | 40 | ## Two-cents 41 | 1. Methods like SimCLR, MoCo, MoCo v2, and BYOL have been proposed for contrastive learning. One could utilize these methods to gain further performance boosts. 42 | 2. Further developments in MLM, like ideas from RoBERTa, AlBERT, ELECTRA, and Llama, can also be used. 43 | 3. The decoder for fine-tuning can also be made using the latest transformer architectures rather than LSTMs. 44 | -------------------------------------------------------------------------------- /summaries/What_do_neural_networks_learn_in_image_classification-A_frequency_shortcut_perspective.md: -------------------------------------------------------------------------------- 1 | # What do Neural Networks Learn in Image Classification? A Frequency Shortcut Perspective 2 | 3 | Shunxin Wang , Raymond Veldhuis , Christoph Brune , Nicola Strisciuglio | University of Twente, The Netherlands | ICCV'23 4 | 5 | ## Summary 6 | 7 | This paper investigates the learning dynamics of NNs in classification tasks, focusing on learned frequency shortcuts and proposes a metric to measure class-wise frequency characteristics and a method to identify frequency shortcuts. 8 | 9 | 10 | 11 | To study the impact of data characteristics on the spectral bias of NNs,and frequency shortcut learning, authors generate four synthetic datasets {B1,B2,B3,B4}, each with a frequency bias in a different band, from low to high. Control frequency bands and special characterstic frequencies are introduced, demonstrating that the NNs can recognize samples of a given class when only part of the frequencies (shortcuts) associated with the special patterns are present in the test data.To summarize, the NNs trained on the four synthetic datasets use frequency differently, but they all adopt frequency shortcuts depending on the data characteristics. 12 | 13 | The paper next introduces a metric called Accumulative Difference of Class-wise average Spectrum (ADCS), to compare the average frequency distributions of individual classes within a dataset. This facilitates the identification of discriminative and simple class-specific frequency characteristics to learn early in training. 14 | This metric considers that NNs are amplitude-dependent for classification and computes the average amplitude spectrum difference per channel for each class within a set $C=\{c_0, c_1, \ldots, c_n\}$ and average it into a one-channel ADCS. The ADCS for class ci at a frequency (u, v) is calculated as: 15 | 16 | 17 | 18 | is the average Fourier spectrum for class $c_i$, x is an image from the set $X_i$ of images contained in that class, and $\mathcal{F}_x(u, v)$ is its Fourier transform. 19 | $ADCS^{c_i}$ (u, v) ranges from 1 − |C| to |C| − 1. A higher value indicates that a certain class has more energy at a specific frequency than other classes. 20 | 21 | 22 | Examples of ADCS for two classes; Humming Bird & Zebra 23 | 24 | 25 | 26 | To identify frequency shortcuts, the authors propose a method based on culling irrelevant frequencies. The relevancy of each frequency to classification by recording the change in loss value when testing a model on images of a certain class with the concerned frequency removed from all channels. The increment in loss value is used as a score to rank the importance of frequencies for classification. 27 | 28 | On testing the influence of shortcuts on OOD generalization, authors found that frequency shortcuts can be transferred to another dataset. 29 | 30 | ## Contributions 31 | 32 | - The paper proposes a method to identify frequency shortcuts, based on culling frequencies that contribute less to classification. 33 | 34 | - The paper examines the influence of frequency shortcuts on the generalization of NNs, especially on Out Of Distribution(OOD) test sets. 35 | 36 | - The paper proposes a metric to measure class-wise frequency characteristics and a method to identify frequency shortcuts. 37 | 38 | ## Results 39 | 40 | - NNs learn frequency shortcuts during training to simplify classification tasks, driven by frequency characteristics of data and simplicity-bias. 41 | 42 | - Frequency shortcuts can be transferred o another dataset, in some cases, giving an illusion of improved generalization. 43 | 44 | - Larger model capacity and data augmentation techniques do not necessarily mitigate frequency shortcut learning. 45 | 46 | ## Two-Cents 47 | 48 | - The study expands previous works on the learning dynamics of NNs for regression tasks, broadens the understanding of frequency shortcuts (which can be either texture-based orshape-based), and provides a more systematic analysis of OOD generalization. 49 | 50 | - Enhancing the identification of frequency shortcuts and applying proper training schemes that avoid frequency shortcut learning may hold promise in improving generalization. 51 | 52 | ## Resources 53 | 54 | Paper:- https://arxiv.org/pdf/2307.09829.pdf 55 | -------------------------------------------------------------------------------- /summaries/binary_TTC.md: -------------------------------------------------------------------------------- 1 | # Binary TTC: A Temporal Geofence for Autonomous Navigation 2 | Abhishek Badki, Orazio Gallo, Jan Kautz, Pradeep Sen, CVPR 2021 3 | 4 | ## Summary 5 | This paper shows a technique to get pixelwise Time To Collision from two given images taken within a short duration of time. This paper proposes to use the 6 | relative size of an object in the latter image with respect to the initial image to get the TTC for that object and its corresponding pixels. 7 | 8 | ## Overview 9 | 10 | -It compares two images that are taken a short amount of time apart, then scales the later image with a scaling factor corresponding to a particular TTC 11 | so that the objects that would collide with the plane of camera would appear bigger with respect to the initial image and vice versa. 12 | -Each TTC would have a corresponding scaling factor with which we can scale the later image to determine which objects would collide on or before and after 13 | the specified TTC, hence dividing all the objects into two classes hence the name binary and with this we can create a binary tree with respect to each 14 | object's Time To Contact. 15 | -We can also use this approach to obtain a quantized TTC for each pixel by taking the scaling factors to the correspondingly quantized TTCs. 16 | 17 | ## Implementation Details 18 | 19 | -The architecture used has been shown as following: 20 | 21 | 22 | ## Strengths 23 | 24 | -Capable of providing TTC in a reasonable amount of time with just normal camera video feed of 150 frames per second. 25 | -The system is simpler than some others where people attempt to find relative position and velocity and then use them to calculate the TTC. 26 | -Unlike some other papers, it doesn't make over-simplifyng assumptions like constant brightness, objects being non-deformable etc. 27 | 28 | ## Weaknesses 29 | 30 | -We have to calculate scaling factors for given TTCs beforehand, it might have been better if the network itself could get the scaling factor as it might 31 | be the case that the scaling factors for same TTC can vary for different positions of camera and enviornments. 32 | 33 | ## Implementation 34 | [https://github.com/NVlabs/BiTTC](https://github.com/NVlabs/BiTTC) -------------------------------------------------------------------------------- /summaries/control_tasks.md: -------------------------------------------------------------------------------- 1 | # Designing and Interpreting Probes with Control Tasks 2 | 3 | John Hewitt, Percy Liang 4 | 5 | ## Summary 6 | 7 | This paper proposes control tasks which help to identify whether probes which are commonly used to predict the linguistic properties of word representations actually are useful in this ...or do they themselves learn how to perform well on linguistic tasks. **This paper also got the Best Paper Runner Up Award at EMNLP 2019** 8 | 9 | ## Main Idea - Control Tasks 10 | 11 | - Let V be the vocabulary containing all word types in a corpus, what happens in a control task is that we independently sample a **Control behaviour** `C(v)` for all v in V (**randomness**). This control behaviour deterministically determines the output y (**structure**), which belongs to the same space Y as the output of the Linguistic task. So all word tokens are assigned its word type output regardless of its context 12 | 13 | - The randomness in the choice on control behaviour makes the properties of the hidden representations useless and leave the probe all for itself.C(v) must be memorized independently for each word type, and a probe taking vectors h 1:T as input must identify for each hi its corresponding xi, and output the element of Y specified by C(xi). 14 | 15 | - They define a new metric selectivity as ` selectivity = linguistic task accuracy - control task accuracy `, so a good probe will be one with high selectivity ie. it will have high linguistic task accuracy while having low control task accuracy ..so it means that the probe by itself can't do anything (low control task accuracy)...its actually some linguistic properties of the hidden representations that help in the linguistic task (high linguistic task accuracy). 16 | 17 | 18 | 19 | 20 | 21 | ## Main Contributions 22 | 23 | - They find that current probes are all over-parameterized cause even when they used rank-decomposition to critically reduce dimensions .. there is not much loss inaccuracy 24 | 25 | - They find that the most selective probes of those tested, even after careful complexity control, are linear or bilinear models. They also have the advantage that they exhibit high selectivity without the need to search over complexity control methods and also have similar accuracy. 26 | 27 | - Dropout seems to have actually hurt selectivity, while other regularization methods like L2 weight decay, constraining the number of training examples, constraining the hidden state dimensionality via rank-decomposition or critically constraining the hidden layer seems to have improved selectivity. 28 | 29 | - They find probes on ELMo2 to be strikingly more selective than those on ELMo1, consistent across all probes, this shows that probes use word identity as a feature to predict part-of-speech, and that feature is less easily available in ELMo2 than ELMo1, though it has a little bit less accuracy. 30 | 31 | - They introduce two control tasks 1.) Part-of-speech tagging control task and 2.)Dependency edge prediction control task, the details might be read from the paper itself. 32 | 33 | ## Our two cents 34 | 35 | - The new feature introduced ..namely, selectivity can be a new standard benchmark parameter in future NLP 36 | - It might lead to future research in improving the quality of inference of linguistic properties of hidden representations like ELMo and BERT 37 | 38 | ## Implementations and References 39 | 40 | - [Code](https://worksheets.codalab.org/worksheets/0xb0c351d6f1ac4c51b54f1023786bf6b2) 41 | - [Blog](https://nlp.stanford.edu/~johnhew/interpreting-probes.html) 42 | 43 | 44 | -------------------------------------------------------------------------------- /summaries/cycada.md: -------------------------------------------------------------------------------- 1 | 2 | # CyCADA: Cycle-Consistent Adversarial Domain Adaptation 3 | 4 | Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei A. Efros, Trevor Darrell, ICML-2018 5 | 6 | ## Summary 7 | 8 | This paper proposes a novel discriminatively-trained Cycle-Consistent Adversarial Domain Adaptation model. Leveraging the cycle-consistency, the model does not require aligned pairs and the author claims state-of-the-art results across multiple domain adaptation tasks. 9 | 10 | ## Main Contributions 11 | 12 | - The main contribution of the paper is its novel techique for domain adaptation using cycle-consistency losses, taking inspiration from the *CycleGAN* paper. But while CycleGAN produced task-agnostic domain transfer, this model has been trained for various particular tasks. 13 | 14 | - Provided is source data Xs, source labels Ys, target data Xt, but no target labels. The goal is to learn a model *f* to correctly predict label for target data Xt. 15 | 16 | - What are the cases to be considered in the cases of basict things that we are -------------------------------------------------------------------------------- /summaries/cycle_dehaze.md: -------------------------------------------------------------------------------- 1 | # Cycle-Dehaze: Enhanced CycleGAN for Single Image Dehazing 2 | 3 | Deniz Engin, Anıl Genc, Hazım Kemal Ekenel 4 | 5 | ## Summary 6 | 7 | This paper proposes an end-to-end network for single-image dehazing problems, which is an enhanced version of the cycleGAN method, i.e, this also works on unpaired images in the datasets. The provided method is better than the state of the art methods for single-image dehazing which use the atmospheric scattering model. 8 | 9 | ## Main contributions 10 | 11 | - The addition of perceptual cycle loss in the original cycleGAN architecture, which compares images in a feature space rather than the pixel space. 12 | 13 | ![Losses](https://raw.githubusercontent.com/engindeniz/Cycle-Dehaze/master/figs/model.png) 14 | 15 | - Since the paper is adopting the original cycleGAN architecture, hence dataset of unpaired images is used. 16 | 17 | - This method does not rely on calculating the parameters of the atmospheric scattering model (which is used in the earlier methods). 18 | 19 | - To upscale the output images, Laplacian pyramid is used instead of the bicubic-upscaling method to get a higher resolution and sharp image for better haze free images. 20 | 21 | - Due to its cyclic structure, the model turns out to be generalized and can be used for other datasets. 22 | 23 | ## Implementation details 24 | 25 | - The CycleGAN model architecture is adapted from the original [CycleGAN](https://arxiv.org/pdf/1703.10593.pdf) paper, with the addition of a cycle perceptual loss inspired by the [EnhanceNet paper](https://arxiv.org/pdf/1612.07919.pdf). To achieve high-resolution output images, the model utilizes the [Laplacian pyramid](https://arxiv.org/pdf/1506.05751.pdf) method. 26 | 27 | ## Two-cents 28 | 29 | - The model achieves very significant results even in the absence of the ground-truth images, unlike the state of the art methods. 30 | 31 | - The is a generalizable model as it learns the dehazing task rather than overfitting on the data. 32 | 33 | - Higher weightage is given to cycle consistency loss as compared to perceptual cycle loss, as higher weightage to the perceptual cycle loss causes loss of color information after dehazing. 34 | 35 | ## Resources 36 | 37 | - Paper:- https://arxiv.org/pdf/1805.05308.pdf 38 | 39 | - Implementation:- https://github.com/engindeniz/Cycle-Dehaze/ -------------------------------------------------------------------------------- /summaries/cyclegan.md: -------------------------------------------------------------------------------- 1 | 2 | # Unpaired Image-to-Image Translation using Cycle Consistent Adversarial Networks 3 | 4 | Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros, ICCV-2017 5 | 6 | ## Summary 7 | 8 | This paper proposes an image-to-image translation mechanism without paired images from both the domains. Unkike its predecessor image translation techniques such as pix2pix it is able to do the task of image translation without paired labelled images by using cycle consistency loss. 9 | 10 | ## Main contributions 11 | 12 | - The main contribution of this paper is the **cycle consistency loss** which the authors have used successfully in various domain transfer problems. 13 | 14 | - The basic idea behind cycle consistency is enfore an image to translate properly to a new domain even with the absence of paired images. To accomplish the task, the author rather than using a single generator and discriminator network has rather used two generators and two discriminators. 15 | 16 | - The first generator(G12) traslates the image from 1st domain to another while its corresponding discriminator(D1) acts a critic to differentiate from real and fake. The generated output from this is then passed through the second generator(G21) which then translates the image back to its original domain with its corresponding discriminator(D2). The same process is repeated but with passing the images from 2nd domain to G21 while forming a complete cycle again 17 | 18 | 19 | 20 | - However even with the above process the network can easily just translate an image to another domain without caring about the information that makes it unique since there are no paired images. For example, in a horse to zebra transformation, with the above process we can easily translate a horse to a zebra but we may just not retain the information such as how the zebra is standing among many others if we follow this process. 21 | 22 | - To rectify this, the authors introcuduced the *cycle consisteny loss* which is basically just comparing the initial image to the image obtained after passing it through the two networks(G12 and G21) and adding that also as a loss in the final loss equation. So we get two cycle consistency losses in our final equation. 23 | 24 | ## Implementation Details 25 | 26 | - They used the architecture from [Johnson et al.](https://github.com/jcjohnson/fast-neural-style) 27 | - They used [LSGAN](https://arxiv.org/abs/1611.04076)(Least squares GAN) instead of normal ones as it seemed to stabilize their training process. 28 | - To reduce model oscillation , they used [Shrivastava et al.’s strategy](https://arxiv.org/abs/1612.07828) 29 | 30 | ## Our two cents 31 | 32 | - The major contribution of the paper was undoubtedly its introduction of cycle consistency loss for unpaired image-to-image translation. The cycle loss has been used extensively in literature for various different tasks including segmentation and various other domain adaptation problems. Some of the experiments done by the community by using principles of this paper have even been underlined in the experiments section of the paper. 33 | 34 | - Although the technique has been able to achieve transformations of complex nature, it still has many limitations on the kind of transformations it can achieve, one of them being complex geometric transformation (eg. dog -> cat). 35 | 36 | - Also some failure cases are caused when applying transformation on the corner cases of examples obtained from training data, which indicates a lack of power by the network to generalize. 37 | 38 | ## Implementation 39 | 40 | - [https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix](https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix) 41 | -------------------------------------------------------------------------------- /summaries/densenet.md: -------------------------------------------------------------------------------- 1 | # Densely Connected Convolutional Networks (DenseNet) 2 | Gao Huang, Zhuang Liu, Laurens van der Maaten, Kilian Q. Weinberger, CVPR-2017 3 | 4 | ## Summary 5 | This paper proposes a Densely Connected Convolutional Networks (DenseNet) which is an extension of the ResNet idea, in which every layer is connected to every other layer(hence the word Dense), and in which the combination is done via concatenation instead of simple summation. 6 | It has significantly fewer parameters than ResNet and also achieved SOTA performance in many object recognition tasks. 7 | **This paper also got the Best Paper Award in CVPR 2017** 8 | 9 | ## Overview 10 | 11 | - The main idea behind the paper is to connect every layer with every other layer in the network. This is done in a feed-forward fashion meaning that all the preceding layers are connected to the current layer, and its output is connected to the input of all the subsequent layers. 12 | 13 | 14 | - As the concatenation operation is valid for feature maps of same sizes only, the network consists of dense blocks which contain such connected layers followed by **transition layers** , which is basically a 1x1 Conv followed by 2x2 average pooling to reduce the size of feature maps. 15 | - Each layer adds k new feature maps to the network, this is called **growth rate**. 16 | - **Bottlenecks** which are 1x1 Conv that occur before 3x3 convs are present in denseblocks to improve computation 17 | - The number of feature maps from a denseblock is scaled m times (called **compression factor** ) via the transition layer. 18 | - Densenet with both bottlenecks and compression are called **Densenet-BC** 19 | 20 | ## Implementation Details 21 | 22 | - They used three Denseblocks that had an equal number of layers. 3x3 convs with zero padding of 1 was used to preserve feature map size. 1x1 convs followed with 2x2 average pooling for transition layers . 23 | - The feature-map sizes in the three dense blocks are 32×32, 16×16, and 8×8, respectively. 24 | - Trained with SGD using Nesterov Momentum without dampening along with dropout. 25 | 26 | ## Strengths 27 | - As every layer is connected there is a lot of feature reuse and hence a small number of feature maps per layer (small k) also leads to good performance which is a clear improvement over ResNets. So it has fewer parameters and comparable accuracy. 28 | - As a layer is closely connected with the output and input layers, so they kind of receive direct supervision and hence it improves gradient and information flow. 29 | - They are also shown to have a regularizing nature and are less prone to over-fitting. 30 | 31 | ## Weaknesses 32 | - The regularisation nature of the network could have been explored better. 33 | - The proper justification of using concatenation over summation of different activations could be explored further with more comprehensive set of experiments. 34 | 35 | ## Implementation 36 | - [https://github.com/liuzhuang13/DenseNet](https://github.com/liuzhuang13/DenseNet) 37 | 38 | -------------------------------------------------------------------------------- /summaries/ecpe.md: -------------------------------------------------------------------------------- 1 | # Emotion-Cause Pair Extraction: A New Task to Emotion Analysis in Texts 2 | 3 | Rui Xia, Zixiang Ding 4 | 5 | ## Summary 6 | 7 | This paper introduces a new task called **emotion-cause pair extraction** (ECPE), which aims to find emotion-cause pairs, this is different from emotion-cause pair extraction(ECE) as it does not depend on emotion annotations, they also propose a novel architecture to solve this task. **This paper also got the Outstanding Paper Award at ACL-2019** 8 | 9 | ## The ECPE task 10 | 11 | - They work on improving the emotion clause - cause clause detection task, in which we identify clauses which express an emotion (emotion) and find the clauses that cause this emotion (cause clause). 12 | 13 | - They suggest that because the ECE task requires emotion annotations beforehand it is not practically applicable and also as emotion and cause clause are co-related .....annotating the emotion clause separately beforehand doesn't take any advantage of this co-relation .....they try to fix these two shortcomings through the ECPE task. 14 | 15 | - So the ECPE task basically has two tasks ..1.) We first do both emotions and cause clause extraction using multi-task learning model(**Individual Emotion and Cause Extraction**) ...which they propose two variants of ....2.) We pair all the emotion and clauses extracted in step 1 together ...and then try to identify which of them actually form an emotion-cause pair. (**Emotion-Cause Pairing and Filtering**) 16 | 17 | 18 | 19 | 20 | 21 | ## Model 22 | 23 | - So first comes Individual Emotion and Cause Extraction, they propose two variants for this task. 24 | 25 | - First is **Independent Multi-task Learning**, so what they do is basically they separate the document into many clauses ...and each clause has some words. So for each clause, they pass its words through a Bi-LSTM, then apply attention to the hidden states obtained to get a clause representaion `s`. 26 | 27 | - These clause representations `s` then go on to the upper layer which has two components, one for emotion extraction and other for clause extraction, both of them are Bi-LSTM, which predict whether each clause is an emotion (cause for the second component) clause given the `s` vectors as inputs. 28 | 29 | - We then use a weighted sum of the cross-entropy loss of the two tasks as our loss function. 30 | 31 | - But wait ...didn't we wanted to use the correlation between emotions and cause, so the second method is **Interactive Multi-task Learning**, this also has two variants 1.) Inter-EC in which we use emotion extraction to improve cause extraction, 2.) Inter-CE in which we use cause extraction to improve emotion extraction. Both of them are almost similar so the paper only introduces Inter-EC. 32 | 33 | - In Inter-EC the lower component which extracts `s` is same as before, the upper layer contains two components, The first one, a Bi-LSTM takes the `s` as input and predicts if its an emotion clause. We embed this label as a vector and pass this vector to the second component which uses this vector along with `s` for cause clause extraction using a Bi-LSTM. 34 | 35 | 36 | - Now that we the emotions and cause clauses, we do cartesian product to obtain all possible emotion and cause pairs. We then use the `s` of both of these clauses, along with a vector `v` representing the distance between the two, and pass it to a logistic regression model to detect if the pair is valid 37 | 38 | ## Main Results and Contributions 39 | 40 | - Both Inter-EC and Inter-CE get great improvement compared to the Independent model. 41 | 42 | - They find that the improvements in Inter-EC are mainly in the recall rate on the cause extraction task, which shows that the predictions of emotion extraction are helpful in cause extraction. 43 | 44 | - Similarly, for Inter-CE the improvements are mainly in the precision score on the emotion extraction task, which shows that the predictions of cause extraction are helpful in emotion extraction. 45 | 46 | - They find that the improvements of Inter-EC on the cause extraction task are much more than the improvement of InterCE on the emotion extraction task, which can be due to the fact that cause extraction is a more challenging task. 47 | 48 | - They observe that their model also performs comparably to the other models which depend on emotion annotation on the ECE task, which means they introduced practicality to the task while maintaining performance. 49 | 50 | ## Our Two Cents 51 | 52 | - The paper's explanation is extremely lucid and clear ...which makes the paper an easy and intuitive read. 53 | - The paper openly admits that there is more scope of improvement in their models for the task of cause extraction and they also want to make the model a single step .....so the paper admits its own faults and seems to want to work on it in the future. 54 | 55 | 56 | ## Implementation and References 57 | 58 | - https://github.com/NUSTM/ECPE 59 | 60 | -------------------------------------------------------------------------------- /summaries/entity_linking.md: -------------------------------------------------------------------------------- 1 | # Zero-Shot Entity Linking by Reading Entity Descriptions 2 | 3 | Lajanugen Logeswaran, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, Jacob Devlin, Honglak Lee 4 | 5 | ## Summary 6 | 7 | This paper presents a new task called zero-shot entity linking where mentions must be linked to unseen entities without in-domain labelled data, they also propose an adaptive pre-training strategy called domain adaptive pre-training (DAP), to address the domain shift problem associated with linking 8 | unseen entities in a new domain. **This paper also got the Outstanding paper award at ACL-2019** 9 | 10 | 11 | ## Introduction 12 | 13 | - Previous work on entity linking typically use powerful resources such as a high-coverage alias table, structured data, and linking frequency statistics, and also focus on linking to general entity databases. The paper uses no such resources and also focuses to generalize on unseen specialized entities. 14 | 15 | - The paper just makes one weak assumption, that there exist **entity dictionaries** which contains the entities `e` and their text descriptions `d`. 16 | 17 | - Their goal is to build entity linking systems that can generalize to new domains and entity dictionaries, which they term worlds, each world contains its own set of distributions of mentions `m` , documents `u` 18 | 19 | 20 | - They construct a new dataset for the zero-shot entity linking task using wikias, since, in wikias, mentions and context have rich document context that can be exploited by reading comprehension approaches. 21 | 22 | - They assume that the target entity exists in the entity dictionary and leave NIL recognition or clustering (NIL recognition means predicting new entities, which are absent in the entity dictionary) 23 | 24 | 25 | 26 | 27 | ## Model 28 | 29 | - They adopt a two-step model for entity linking ...first is a fast candidate generation stage ..which uses BM25, a variant of TF-IDF to measure the similarity between mention string and candidate documents and takes the top-k entities for further tasks. 30 | 31 | - The second step is ranking these entities, we basically compare the mention in context and candidate entity description using models strong in reading comprehension and natural language inference tasks ie. Transformers. 32 | 33 | - They use a variant called **Full Transformer** in which the mention in context `m` and candidate entity description `e` are concatenated and input to the model. Mention words are signalled by a special embedding vector that is added to the mention word embeddings. By jointly encoding the entity description and the mention in context with a Transformer, they can attend to each other at every layer. 34 | 35 | - They show that such deep cross-attention helps the model outperform the previous transformer based approaches namely **Pool-Transformer** and **Cand-Pool-Transformer** which didn't use cross-attention in a similar manner. 36 | 37 | 38 | ## Domain-adaptive pre-training (DAP) 39 | 40 | - They also pretrain the model in an unsupervised fashion on different sets of data to improve further downstream tasks. They review two previous approaches and also propose a new approach DAP. 41 | 42 | - First is **Task-adaptive pre-training** in which model is pre-trained on the source and target domain unlabeled data jointly with the goal of discovering features that generalize across domains. 43 | 44 | - Second is **Open-corpus pre-training** in which model is pre-trained on large corpora in hope to partially capture the target-domain distribution along with it. 45 | 46 | - The third is a new method suggested by them called **Domain-adaptive pre-training** (DAP) which only pretrains on the target-domain data. The intuition for DAP is that representational capacity is limited, so models should prioritize the quality of target domain representations above all else. 47 | 48 | - In all the three approaches the model is fine-tuned on the source-domain labelled data at the end .....they also combine these approaches together to further improve performance. 49 | 50 | 51 | ## Our two cents 52 | 53 | - The paper introduces a new task called zero-shot entity linking, and also introduces a new dataset for it ....thus they made way for new research to occur in this field and is thus research provoking. 54 | 55 | - They admit that there is much scope for improvement in the candidate generation step ....so it admits its own faults and hopes to work on them in the future. 56 | 57 | ## Implementation and References 58 | 59 | - https://github.com/lajanugen/zeshel 60 | 61 | 62 | -------------------------------------------------------------------------------- /summaries/feature_denoising.md: -------------------------------------------------------------------------------- 1 | 2 | # Feature Denoising for Improving Adversarial Robustness 3 | 4 | Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan Yuille, kaiming He, CVPR-2019 5 | 6 | ## Summary 7 | 8 | This paper proposes a new technique to achieve adversarial robustness for a Deep ConvNet by leveraging the fact that adversarial perturbation lead to noise in the features extracted by these networks. So their network contains noise blocks for denoising feature maps and hence making the network classify the input correctly. 9 | 10 | ## Main contributions 11 | 12 | - The main contribution of the paper is their feature denoising module that has been used to denoise the feature map obtained when passing a perturbed/normal image through the network. 13 | 14 | - The concept that the authors leverage for adversarial robustness is that the activation maps of adversarial examples are very different from the ones obtained by the normal inputs in the sense that they are noisy or more precisely assign weights to parts of images not required for classification or our main purpose. 15 | 16 | - So to fix this, a denoising block is added to the network after one of the layers to get activations close to normal activations and hence make the decision process by the network accurate. 17 | 18 | 19 | 20 | - As can be seen from the figure, the major component of this block is the denoising operation which denoises the image while 1\*1 conv and the residual connection are mainly for feature combination. The parameters for the denoising block can be learned in an end-to-end fashion. 21 | 22 | - The various Denoising operations they have used in the paper are as follows: 23 | - **Non-Local Means**: Computes denoised feature map of an input feature map by taking a weighted mean of features in all spatial locations. The various ways to select the weighting function: 24 | - Gaussian Softmax: Kind of softmax-based self-attention computation of the features 25 | - Dot Product: Simple dot product between features to get the weighting function. Provides denoising operation with no extra parameters. 26 | - **Bilateral filter**: Computes denoised feature map of an input feature map by taking a weighted mean of features in a predefined neighborhood. The various ways to calculate the weighting function: 27 | - Gaussian Softmax and Dot product are same as before. 28 | - Mean filter: Computation of simple mean of the features in the neighborhood window defined. 29 | - Median filter: Computation of simple median of the features in the neighborhood window defined. 30 | 31 | ## Implementation Details 32 | 33 | - Usage of PGD attacker to get adversarial examples to train the denoising block. 34 | 35 | - Choosing their baselines as ResNet-101/152, they have added 4 denoising blocks to a ResNet, each added after last residual block of res2, res3, res4 and res5 respectively. 36 | 37 | ## Our two cents 38 | 39 | - The notion of introducing the denoising block in the network has leveraged one of the basic concepts of adversarial examples and has given pretty competitive results. 40 | 41 | - Also the fact that the denoising block can be learned in an end-to-end fashion means that the modified network can also be used for simple image training, with little to no change in accuracy, as has been stated by the authors as well. 42 | 43 | - However, the work can still be extended further with the usage of even more complex denoising blocks, with operations not just directly on the feature maps but some kind of higher level representations of those, trainable in an end-to-end fashion in the same way. 44 | 45 | ## Implementation 46 | 47 | - [https://github.com/facebookresearch/ImageNet-Adversarial-Training](https://github.com/facebookresearch/ImageNet-Adversarial-Training) -------------------------------------------------------------------------------- /summaries/florence.md: -------------------------------------------------------------------------------- 1 | # Do you know that Florence is packed with visitors? Evaluating state-of-the-art models of speaker commitment 2 | 3 | Nanjiang Jiang and Marie-Catherine de Marneffe 4 | 5 | ## Summary 6 | 7 | - This paper evaluates two state-of-the-art speaker commitment models on CommitmentBank, an English dataset of naturally occurring discourses. They propose that linguistically informed model outperforms an LSTM-based one, suggesting that linguistic knowledge is needed to capture such challenging naturalistic data. **This paper also got the Best Short Paper Award at ACL 2019** 8 | 9 | ## Main Contributions and Results 10 | 11 | - The paper focuses on the task of speaker commitment which hopes to determine how much is a speaker committed to an event, like does the speaker believe the event is true, false or is uncertain about it. 12 | 13 | - They perform the evaluation on CommitmentBank, which has linguistically diverse naturally occurring sentences, making the task even more difficult, whereas previous works focused on constructed or news-wire examples, which may simplify the task by failing to reflect the lexical and syntactic diversity of naturally occurring utterances. 14 | 15 | - Each item in the dataset consists of up to two context sentences and one target sentence, Participants judged whether or not the speaker is certain that the content of the complement in the target sentence is true. 16 | 17 | - They evaluated two models on this dataset 1.)Rule-based model - It’s a rule-based algo, which uses a top-down approach on a dependency tree and predicts speaker commitment. It is kind of based on the implicative nature of the predicates and also whether they are under the scope of some modifier to change the factuality signs. 18 | 19 | - Second is a neural-based approach, for which three models, a linear biLSTM, a dependency tree biLSTM, a hybrid model that ensembles the two. 20 | 21 | - They observed that the rule-based model outperforms the biLSTM models, but both of these SOA models don't perform very well on the CommitmentBank. 22 | 23 | - This shows that a linguistically-informed model scales more successfully to challenging naturalistic data 24 | 25 | - They also observed that the rule-based approaches predictions are clustered at +3(strong positive commitment) and -3(strong negative commitment), whereas that of the hybrid neural approaches are at +3 ....this means that these SOA models fail to capture the in-between cases very well. 26 | 27 | 28 | 29 | 30 | ## Our Two Cents 31 | 32 | - The paper could have proposed a new approach for the commitment bank dataset, instead of using the previous SOTA model. 33 | 34 | - The paper proposes that we need to add more linguistic features to the current models to improve performance ...and is thus research promoting. 35 | 36 | 37 | 38 | -------------------------------------------------------------------------------- /summaries/frequency_bias.md: -------------------------------------------------------------------------------- 1 | # On The Frequency Bias of Generative Models 2 | Katja Schwarz, 3 | Yiyi Liao, 4 | Andreas Geiger, 5 | NeurIPS 2021 6 | 7 | ## Summary 8 | The paper discusses the problem of a bias shown towards high frequencies in existing GAN models thus making it pretty straightforward to detect real and fake images using a simple classifier . Any image can be viewed in the frequency domain as well by taking a discrete 2D fourier transform of the image . We can view it in the reduced spectrum by taking the azimuthal average over the spectrum in normalized polar coordinates. 9 | 10 | 11 | 12 | 13 | ## Contributions 14 | * To detect the fake images, they fit a power law function to the tail of the reduced spectrum for frequencies above a given threshold rc=0.75 and train a binary classifier on the fit parameters of each spectrum as proposed by Dzanic et al. 15 | * The paper then investigates the causes of this bias by assessing the generator and discriminator architectures independently. It looks at how the various upsampling operations in the generator and downsampling in the discriminator might lead to the high-frequency artifacts 16 | 17 | 18 | 19 | ## Method 20 | * The paper first looks at the PGAN generator independently. The generator is trained by pairing images from a dataset with latent codes and using a pixel-wise MSE loss. The training dataset is a 100-image Toyset and CelebA is used for testing. 21 | * Upsampling operations like bilinear and nearest neighbour upsampling force the generator to produce low high-frequency content. On the other hand, zero insertion and reshaping produce checkerboard artifacts which can be reduced by the learnable parameters in the further layers by introducing an L-2 loss on the logarithm of the reduced spectrum, which is more sensitive to errors at low magnitudes. 22 | * The training signal of the PGAN discriminator is then analysed . The discriminator is trained by pairing 10 images with 10 labels and optimize 10 learnable tensors and discriminator weights using the GAN 2 player setting 23 | * The downsampling operations don't have a bias towards high frequencies in general but struggle to learn frequencies having low magnitude in the spectrum (which are high frequencies in natural images). 24 | * They finally test StyleGan2 and observe a peak at high frequencies in the generated images. 25 | 26 | 27 | 28 | 29 | 30 | ## Conclusion 31 | We find that while bilinear and nearest neighbour upsampling produce low magnitude high-frequency content , zero insertion and reshaping produce checkerboard artifacts in the reduced spectrum. The discriminator is not generally biased towards high frequencies but struggles with low magnitudes. The quality of the training signal of discriminator is worsened due to downsampling operations . The error is reduced with a spectral discriminator and training on wavelet space but a lot of improvement is still required. 32 | 33 | ## Two Cents 34 | * A lot of scope is still left in improving the characteristics of the reduced spectrum of generated images. More focus should be put on improving the architecture of the discriminator as the generator alone isn't the sole cause of such artifacts. 35 | * Other generative models like Stable Diffusion should also be tested for any such frequency bias 36 | 37 | ## Resources 38 | - https://arxiv.org/pdf/2111.02447 39 | 40 | 41 | 42 | 43 | -------------------------------------------------------------------------------- /summaries/graph_unet.md: -------------------------------------------------------------------------------- 1 | 2 | # Graph U-Nets 3 | 4 | Hongyang Gao, Shuiwang Ji, ICML-2019 5 | 6 | ## Summary 7 | 8 | This paper proposes a U-Net like architecture for graphical data and tries pretty good performance on node classification and graph classification tasks. Also for this task, they develop a novel pooling and unpooling techniques for graphical data, which is essential to get wider perspective during classification process, just like in the case of simple image data. 9 | 10 | ## Main contributions 11 | 12 | - The main contribution of this paper is its novel pooling and unpooling technique which has been incorporated in their U-Net architecture(which is similar to as for the case of images) to obtain better results on graph classification problems. 13 | 14 | - The simple pooling operation for image data cannot possibly work in the case of graphical data as the operation is reliant on spatial information while there is no sense of spatial information in the case of graphs. 15 | 16 | 17 | 18 | - So for **gPool operation**, we take a projection of **X**(The information matrix for nodes) to make it a 1D vector and then take *top-k* nodes, where k is according to our need. Once the index of the *top-k* nodes are noted, the rows from **X** corresponding to them are kept while others are removed. So we get a reduced information matrix out of this process. 19 | 20 | - However to include the projection vector **p** in the training process so that it is also learned, we multiply the *sigmoid* of *top-k* nodes to the reduced information matrix found above to get the final reduced information matrix. Also to get a new Adjacency matrix we just pick the entries corresponding to the remaining nodes and hence obtain a reduced adjacency matrix for our reduced information matrix. 21 | 22 | - For **gUnpool operation**, we just reverse the above process by adding zeroes in the information matrix in the exact places which were removed in the *top-k* selection process. And also to get the corresponding adjacency matrix, we just simply inflate the reduced adjaceny matrix with the same values(i.e. connection information) that was removed in the gPool operation. 23 | 24 | - Once these operations are decided, the author moves forward with the U-Net architecture, which is composed of encoder and decoder parts. The encoder consists of Convolutions and gPool operations to reduce the graph size and get useful information, while decoder consists of Convolutions and gUnpool operations to get the graph to its original size and also classify its nodes from the information extracted by Encoder. Skip connections are used between Encoder and Decoder. 25 | 26 | ## Implementation Details 27 | 28 | - For whole of their experiments, they have modified the convolution operation by using **A+2I** rather than **A+I** (a pretty common practise that is followed before convolution), where A is adjacency matrix and I is the identity matrix. The authors argue that more emphasis on node being operated upon lead to better results. 29 | 30 | - **Graph Connectivity Augmentation using Graph Power** is a technique used by authors to deal with the possibility of isolated nodes because of pooling operation. To deal with this issue, the authors use *kth* power of the graph to increase graph connectivity. In their work they employ *k=2*. 31 | 32 | ## Our two cents 33 | 34 | - The novel technique for pooling operation introduced while leveraging the already deduced power of U-Net architecture undoubtedly is able to produce some pretty good results for graph classification problem. 35 | 36 | - To be able to properly generalize to image data, the usage of **Graph Connectivity Augmentation using Graph Power** is of paramount importance for this pooling operation as it leads to forming of a connection between the remaining nodes and not just the connections after the simple gPool operation. 37 | 38 | ## Implementation 39 | 40 | - [https://github.com/HongyangGao/gunet](https://github.com/HongyangGao/gunet) -------------------------------------------------------------------------------- /summaries/imagen.md: -------------------------------------------------------------------------------- 1 | # Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding 2 | 3 | Google Research Brain Team(Ontario, Canada), Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, Mohammad Norouzi, May 2022 4 | 5 | ## Summary 6 | 7 | 8 | 9 | 10 | Among a lot of text-to-image models that surfaced in 2022, google also introduced its respective version named Imagen. The model has a typical text-to-image model pipeline comprising an encoder and then a diffusion model which is then followed by an arrangement to upsample the image from _64*64_ to _1024*1024_. One of the main findings of the research team was that using a large generic language model such as T5 or BERT pretrained only on text-only corpora is more effective than increasing the size of the diffusion model. The model has been trained on a combination of an internal dataset having 460M text-image pairs and the Laion dataset having 400M text-image pairs. The diffusion model has around 2B parameters!! 11 | 12 | ## Contributions 13 | 14 | - Large frozen Language models trained only on text data are very effective encoders for text-to-image generation and scaling their size improves sample quality significantly 15 | 16 | - Proposed Efficient U-Net, a new architectural variant that is simple, fast, and memory efficient. 17 | 18 | - Introduced dynamic thresholding, a new diffusion sampling technique generating more photorealistic images. 19 | 20 | - Achieved a new state-of-the-art COCO FID at 7.27 and challenged itself on a new challenging benchmark, DrawBench. 21 | 22 | ## Method 23 | 24 | The model uses T5(Text-to-text transfer transformer) as the encoder. The main intuition is that because of its sheer size it'll learn useful representations. For the 64*64 image diffusion model, U-Net architecture is employed. The network is then conditioned on text embeddings via a pooled embedding vector added to the diffusion timestamp. Further model is conditioned on the entire sequence of text embeddings by adding cross attention at multiple resolutions. Finally, super-resolution models adapted from some previous works with slight modifications are used to upscale the image. 25 | 26 | 27 | ## Results 28 | 29 | The model, when evaluated on the COCO validation set, achieves a zero-shot FID score at 7.27 outperforming DALL-E 2. The images generated by the model were preferred 0.395 times out of 1 to be more photorealistic than the original COCO dataset images when asked to people. 30 | 31 | Imagen is also compared on a newly introduced benchmark,*Drawbench*. Drawbench is a collection of 200 prompts that challenge model on different capabilities such as ability to faithfully render different colors, number of objects, spatial relations, text in the scene and many more. Human raters are then presented with images from 2 models and asked to compare the models on sample fidelity and image-text alignment. Results are shown in the image below. 32 | 33 |
34 | 35 | 36 | ## Two-Cents 37 | 38 | Imagen's results speak for themselves and mark another great success in the area of text-to-image generation and generative modeling. Imagen also adds to the list of the great accomplishments of Diffusion Models, which have taken the Machine Learning world by storm over the past few years with a string of absurdly impressive results. 39 | 40 | Another conclusion that can be drawn is that we need to keep scaling up models to achieve more fascinating results. 41 | 42 | ## Resources 43 | 44 | Project Site: https://imagen.research.google/ 45 | -------------------------------------------------------------------------------- /summaries/info_bottleneck.md: -------------------------------------------------------------------------------- 1 | # Specializing Word Embeddings (for Parsing) by Information Bottleneck 2 | Xiang Lisa Li, Jason Eisner 3 | 4 | ## Summary 5 | 6 | This paper proposes a method based on Variational Information Bottleneck to compress word embeddings like BERT and Elmo into a discrete or continuous version in a way that is very fast and also more accurate for parsing problems. **This paper also got the best paper award in EMNLP 2019** 7 | 8 | ## Main Contribution 9 | 10 | - The method compresses the embeddings by extracting just their syntactic properties—specifically, the information needed to reconstruct parse trees, unlike other embeddings which also focus on semantic meanings. 11 | 12 | - Their method, VIB, goes beyond mere dimensionality reduction, it also gives flexibility in the effective number of dimensions being used for each token by blurring unneeded complexity via randomness. 13 | 14 | - This method is complementary to previous fine-tuning approaches as it learns to exploit existing info found by BERT or Elmo instead of adding new info via fine-tuning. 15 | 16 | - The discrete representations are tuned explicitly for discriminative parsing, so they prove to be even more useful for summarizing the syntactic properties of a work token than the POS tags, even at the same level of granularity. 17 | 18 | - The continuous representations are also more useful than the uncompressed Elmo representations when it comes to generalizing to test data. 19 | 20 |

21 | 22 |

23 | 24 | ## Loss Function 25 | 26 | 27 | 28 | 29 | - The goal is to learn a stochastic map `p(t | x)` from X to some compressed representation T 30 | 31 | - The picture on the left is the theoretical loss function proposed, while the one on the right is the variational information bottleneck (VIB) loss function achieved after applying variational inference for various probability distributions in the theoretical loss. 32 | 33 | - **I(X; T)— the Token Encoder `p(t | x)`** - Here I is the mutual information ...so if this term is low ..then it means that the compression does not retain very much information about X ...as it can be seen in the right picture ..the exact calculation of this term is done via variational inference ..which also produces the KL Divergence term as the upper bound. 34 | - **I(Ti; X|Xˆi) — the Type Encoder s(ti|xˆi)** - which measures the amount of information about Ti given by the sentence X as a whole, beyond what is given by Xˆi (Elmo’s level-0 embedding of word i, a word type embedding that does not depend on context). This is done cause Elmo's embeddings extract info from other words in the sentence also ..but we want to depend on the current word only 35 | - It is also modelled via an upper bound using variational inference leading to another KLD term 36 | 37 | - **I(Y; T) — the Decoder q(y | t)** - This term favours the predictive accuracy, making I(Y; T) large tries to obtain a high log-probability p(y | t) for the true parse y when reconstructing it from t alone. Again modelled via variational inference to give a lower bound 38 | 39 | - Regarding beta and gamma as a Lagrange multiplier, we see that the goal of IB is to maximize the predictive power of T (I(Y; T)) subject to some constraint on the amount of information about X that T carries (I(X; T)) while depending on the current word only (I(Ti; X|Xˆi)) 40 | 41 | ## Architecture & Implementaion Details 42 | 43 | - For the token encoder, to obtain continuous tags, define `p(ti | xi)` such that ti is Gaussian-distributed, to obtain discrete tags follow a softmax distribution, x computed from the Elmo word vector xi via a feedforward neural network and no transfer function at the output layer. 44 | - For the type, encoder uses the same architecture as the token encoder, except that `p` takes a token vector as input whereas `s` takes a context-independent type vector. `s` is not used at test time, but only as part of our training objective. 45 | - For the decoder, they use the deep biaffine dependency parser and some other algorithms (Tutte’s matrix-tree theorem, directed spanning tree algorithm ) 46 | ## Our two cents 47 | 48 | - The network seems to be very computationally efficient ...being trained on a single GPU which makes it quite feasible 49 | - The tags generated seem to compare well with gold POS tags, showing clustering of similar tags together, they also conducted experiments that proved that semantic info is destroyed as compression is increased ...which is a good sign ....also the patterns revealed in the annealing of discrete tags show that the results are coherent. 50 | - Could be applied to other model embeddings by changing the model-specific -decoder ...makes the process quite versatile. 51 | 52 | 53 | -------------------------------------------------------------------------------- /summaries/infomax.md: -------------------------------------------------------------------------------- 1 | 2 | # Putting an End to End-to-End: Gradient-Isolated Learning of Representations 3 | 4 | Sindy Lowe, Peter O' Connor, Bastiaan S. Veeling, NIPS-2019 5 | 6 | ## Summary 7 | 8 | The paper proposes a novel self-supervised learning technique which rather than employing end-to-end training, focusses on employing isolated training of various modules of the network using greedy InfoNCE objective. This paper heavily inspires from biological phenomenons. This paper won **Honorable Mention Outstanding New Directions Paper Award** at NeurIPS 2019. 9 | 10 | ## Main Contributions 11 | 12 | - The main contribution of this paper is its greedy InfoNCE objective that is used for training the model in modules rather than end-to-end. This could lead to huge improvement in the computation time and also lead to overcoming the memory bottleneck. The model learns in a rather self-supervised manner by using mutual information between various patches as supervision for training of patches. 13 | 14 | - This technique has been highly inspired from various principles of neuroscience with the fact that rather than optimizing a single objective function, brain rather functions in a modular way and optimizes local information. 15 | 16 | - The self-supervised persona of the model comes from the fact that it tries to maximize mutual information of representations of temporally nearby data. This seems to work because of the presence of *slow features* in the data that are highly effective for downstream tasks. 17 | 18 | 19 | 20 | - As seen from the image, the work is focussed heavily on Contrastive Predictive Coding(CPC) as introduced in [Oord et al.](https://arxiv.org/abs/1807.03748). The basic principle behind CPC being maximizing mutual information between representations of temporally nearby patches. 21 | 22 | - Elucidating on the work in CPC, a summarization of encoding until time t, *ct* is taken by employing an autoregressive model over representations of input until time t. This *ct* is then used to compute mutual information loss with future input representations *zt+k*. Also rather than doing this directly a form of negative sampling is used where a bag of future representations is taken *{zt+k, zj1 ... zjN-1}* with only one positive sample and N-1 'negative samples' 23 | 24 | - Following this idea, the authors suggest **Greedy InfoMax** which is used to greedily train separate modules in the network. So to do so, first representations are extracted from *M-1* module to be passed onto *M* module, so ztM = GradientBlock(encoding(xtM-1)). The GradientBlock helps for the gradient to not pass backward. 25 | 26 | - So to train each module separately, the strategy followed in CPC is directly taken up with one modification of rather not choosing autoregressive model to summarize information till time t, rather the encoding at time t is directly used to calculate mutual information between temporally nearby patches which was found as good during implementation. 27 | 28 | ## Implementation Details 29 | 30 | - The InfoNCE objective works the better as the number of negative samples are increased for its training. 31 | - The autoregressive model has been used for greedy infomax in certain categories of data like speech recognition where a broad spectrum of information is required to calculate a reliable mutual information. 32 | - The autoregressive models that are being used by the authors are GRU or a Pixel-CNN type model. 33 | 34 | ## Our two cents 35 | 36 | - This method owing to its training using modules can certainly lead to training of deeper models both faster and feasible. 37 | - Owing to its local gradients, the problem of vanishing gradients is certainly reduced. 38 | - While not similar the idea proposed in the Synthetic Gradients paper by Deepmind certainly lies somewhere near this one. Check it out [here](https://deepmind.com/blog/article/decoupled-neural-networks-using-synthetic-gradients) 39 | - It takes some serious inspiration from neuroscience considering its fuctioning comes pretty close to that of brain which does process its perceptions by maximally preserving the information of the input activities in each layer. 40 | - However, the main objective of the paper is maximally inspired from the paper by [Oord et al](https://arxiv.org/abs/1807.03748). 41 | 42 | 43 | ## Implementation 44 | 45 | - [https://github.com/loeweX/Greedy_InfoMax](https://github.com/loeweX/Greedy_InfoMax) 46 | - A [blog](https://yann-leguilly.gitlab.io/post/2019-09-29-representation-learning-with-contrastive-predictive-coding/) to read more about InfoNCE and Contrastive Predictive Coding. -------------------------------------------------------------------------------- /summaries/knowledge_distillation_with_model_calibration.md: -------------------------------------------------------------------------------- 1 | # Rethinking the Knowledge Distillation From the Perspective of Model Calibration 2 | Albert Gu, Karan Goel, Christopher Ré, **ICLR** **2022** 3 | 4 | 5 | ## Summary 6 | 7 | This paper builds up on the idea that more accurate teachers are not better teachers due to a mismatch in abilities. They go on to analyze these ideas through experiments on toy datasets. They confirm this observation and propose a simple calibration technique to address this. Results follow their hypothesis. 8 | 9 | 10 | ## Contributions 11 | 12 | - They found that the larger teacher model may be too over-confident, so the student model cannot effectively imitate it. 13 | - While, after simple model calibration of the teacher model, the size of the teacher model has a positive correlation with the performance of the student model 14 | 15 | 16 | ## Method 17 | 18 | 19 | 20 | - To analyze the effects of Model calibration, they take student and teacher as CNN models `Resnet.` 21 | - To address the over-confident issue, they minimize the negative log-likelihood (NLL) to train the temperature parameter $T$ , and use the temperature scale method to perform simple post-processing calibration on the model 22 | 23 | $$ \begin{aligned} 24 | \mathcal{L} &=-\sum_{i=1}^{n} \log \left(\hat{\pi}\left(y_{i} \mid \mathbf{x}_{i}\right)\right) \\ 25 | \hat{q}_{i} &=\max _{k} \sigma_{\mathrm{SM}}\left(\mathbf{z}_{i} / T\right)^{(k)} 26 | \end{aligned} $$ 27 | - $q_i$ is the calibrated confidance 28 | - They are simply changing the original knowledge temprature $t$ to some optimal learned temprature $T$ 29 | 30 | 31 | ## Results 32 | 33 | - Done on `CIFAR-10` and `Fashion-MNIST` 34 | - They use ECE `Expected Calibration Error` as the matrix. ECE values can be used to calibrate (adjust) a neural network model so that output pseudo-probabilities more closely match the actual probabilities of a correct prediction 35 | 36 | 37 | 38 | - Results confirm that ECE and NLL go down after calibration. These are also reflected in the accuracy of the student model 39 | 40 | 41 | 42 | 43 | ## Two Cents 44 | 45 | - This paper offers confirms the observation that a bigger student model doesn't directly imply better student training 46 | - Simply changing the original knowledge temperature to a learned temperature shows improvement 47 | - It's observed that the model depth width, weight decay, and batch normalization are also essential factors in model calibration, this paper doesn't take this into account. There's scope here 48 | 49 | 50 | ## Resources 51 | 52 | - [Original Paper](https://arxiv.org/pdf/2111.01684.pdf) 53 | -------------------------------------------------------------------------------- /summaries/machine_unlearning.md: -------------------------------------------------------------------------------- 1 | # Machine-Unlearning 2 | Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette-Choo, Hengrui Jia,Adelin Travers, Baiwu Zhang, David Lie,Nicolas Papernot ,**IEEE** **2020** 3 | 4 | ## Summary 5 | 6 | The paper provides a new way of training ML model named SISA(Sharded, Isolated, Sliced, and Aggregated training) training.It formalises the definition of unlearning and is primarily motivated by privacy concern and therefore also showcases some straw-man solutions.The algorithm is applicable to wide variety of models ranging from simple linear regression model to complicated deep neural network(DNN).Though it is not applicable to decision tree model.Basically it is applicable to all models that learns in iterative manner using gradient descent and not some criteria like Gini-criteria for decison tree.But it primarily focuses on DNN and gives good results for them as compare to classic ML models.It also compares it results with naive approach and highlights the lack of work in this field along with the limited knowledge of how a single datapoint influences model parameters. 7 | 8 | ## Contributions 9 | 10 | The main contribution of the paper is that it provides simple strategies like SISA training which empower user to erase his/her data completely remove from a model. 11 | 12 | The paper is primarily motivated by privacy and hopes to spur follow-up work on effective ways of machine unlearning. 13 | 14 | ## Working 15 | 16 | Incase of no knowledge regarding distribution of unlearning request i.e uniform distribution:-> 17 | 18 | 1.)Split the dataset into several parts known as shards.Further divide the shards into slices which are to be used for incremental learning of individual model using Stochastic Gradient Descent Algorithm. 19 | 20 | 2.)Train individual models on shards and give final prediction by aggregating prediction made by individual models. 21 | 22 | 3.)While training the models on shard save model parameter after training on 1 slice then train it on 2nd slice and so on. 23 | 24 | 4.)At the time of unlearning request first locate the shards and slices in which the datapoint that is to be erased is present. 25 | 26 | 5.)Delete the datapoint and then retrain it on those slices after deletion of the datapoint with saved model parameters till that slice in which datapoints not to be unlearned are there. 27 | 28 | Incase of prior knowledge regarding distribution of unlearning request:-> 29 | Make shards accordingly whose detail are given in the paper except that all steps are same. 30 | 31 | 32 | 33 | ## Results 34 | |Dataset | Dimensionality | Size | Number of Classes | Model-Architecture | 35 | | ------ | ------------------|--------------|-----------------------|----------------------------------------| 36 | | MNIST | 28 × 28 | 60000 | 10 | 2 conv. layers followed by 2 FC layers | 37 | |Purchase | 600 | 250000 | 2 | 2 FC layers | 38 | | SVHN| | 32×32×3 | 604833 | 10 | Wide ResNet-1-1 | 39 | |CIFAR-100 | 32×32×3 | 60000 | 100 | ResNet-50 | 40 | |Mini-Imagenet| 224 × 224 × 3 | 1281167 | 1000 | ResNet-50 | 41 | |Imagenet | 224 × 224 × 3 | 128545 | 100 | ResNet-50 | 42 | 43 | 44 | With the above conditions following results were obtained:-> 45 | 1.)As number of shards increased unlearning time decreased but aggregate accuracy of the model decreased as individual models were required to be trained on even smaller dataset causing them to become weak learners. 46 | 2.)Increase in number of slices doesn’t have any effect on accuracy as long as number of epoch are recalibrate accordingly. 47 | 3.)Above a threshold of unlearning request, accuracy of SISA learning decreased below the naive approach of retraining the model from scratch. 48 | 4.)Below the threshold SISA learning beats the naive approach in both accuracy and unlearning time. 49 | 50 | ## Two-cents 51 | 52 | The paper is written in a very simple and easy to understand language.Provides a clear algorithm of how to perform SISA learning.The paper provides detail results including those where it’s fall short against simply retraining model from scratch. 53 | 54 | The paper also gives a realistic scenario where we have some idea regarding distribution of unlearning request which could be used as a prior in our analysis using Bayesian probability so that we split our dataset in shards and slices accordingly. 55 | 56 | ## Resources 57 | 58 | Resource to paper:-> 59 | https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjMxIrd2ueEAxV31zgGHSISDmAQFnoECBAQAQ&url=https%3A%2F%2Farxiv.org%2Fabs%2F1912.03817&usg=AOvVaw14pLgVFg_7YgfCXKBQjSKN&opi=89978449 -------------------------------------------------------------------------------- /summaries/siamese.md: -------------------------------------------------------------------------------- 1 | 2 | # Siamese Recurrent Architectures for Learning Sentence Similarity 3 | 4 | Jonas Mueller, Aditya Thyagarajan, AAAI-2016 5 | 6 | ## Summary 7 | 8 | This paper proposes a siamese adaptation of the Long Short-Term Memory \[LSTM\] network for labeled data comprised of pairs of variable-length sequences. The proposed model is applied to assess semantic similarity between sentences, improving over the state of the art, outperforming carefully handcrafted features and recently proposed neural network systems of greater complexity. 9 | 10 | **Main contributions**: 11 | 12 | - A simple adaptation of the LSTM has been trained on paired examples to learn a highly structured space of sentence-representations that captures rich semantics. 13 | 14 | - The model uses LSTMs to read in word-vectors representing each input sentence and employs its final hidden state as a vector representation for each sentence. This vector representation is used to calculate the similarity between the input sentences.![](https://lh4.googleusercontent.com/k4yl5iF05iwpY4PKwdgt3k9Xq3Uoh9FOPfkqQsyfqO8ebWM935__OpOCBfqDXCt__AKqDX-XAPzLIBMzeJQRMbJB2maYDEpERQfemcLJS8XALxI4LRa2QGdTmhWFZHmwWDiQQqd4) 15 | 16 | 17 | 18 | 19 | 20 | - Unlike previous works on sentence similarity, which employ manual feature generation, substantial architectural engineering in their models or complex learning algorithms on the representations, the model proposed here consists of just two simple LSTM networks with tied weights. Also, a simple similarity function is used which forces the LSTM to entirely capture the semantic differences during training. 21 | 22 | - **L1 vs L2 norm**: Using a L2 rather than L1 norm in the similarity function can lead to undesirable plateaus in the overall objective function 23 | 24 | 25 | - **Data augmentation**: To ensure invariance to precise wording, and generate more training examples, 10,022 additional training examples are generated by replacing random words with one of their synonyms found in Wordnet \[Miller 1995\]. 26 | 27 | 28 | - **Pre-training**: The MaLSTM is pre-trained on separate sentence-pair data provided for the earlier SemEval 2013 Semantic Textual Similarity task \[Agirre and Cer 2013\]. The weights resulting from this pre-training thus form our starting point for the SICK data 29 | 30 | 31 | ## Strengths 32 | 33 | - The model is trained on the explicit task of semantic similarity, hence semantic structure in the resulting representation is immediately evident, unlike previous works which rely on complex operations over their learned representation. 34 | 35 | - Even though the representations have been learned for sentence similarity, they perform almost as good as previously proposed methods for "Entailment classification" task. This shows that the representations are able to store the general semantics of the sentences. 36 | 37 | 38 | ## Weaknesses / Notes 39 | 40 | - Due to limited training data, performance gains by switching to multi-layer or bidirectional LSTMs have not been evaluated. 41 | 42 | 43 | 44 | 45 | 46 | ## Implementation 47 | 48 | [https://github.com/dhwajraj/deep-siamese-text-similarity/](https://github.com/dhwajraj/deep-siamese-text-similarity/) 49 | [https://github.com/aditya1503/Siamese-LSTM](https://github.com/aditya1503/Siamese-LSTM) -------------------------------------------------------------------------------- /summaries/style_GAN.md: -------------------------------------------------------------------------------- 1 | # A Style-Based Generator Architecture for Generative Adversarial Networks 2 | 3 | - Authors:- Tero Karras, Samuli Laine, Timo Aila 4 | - Conference name:- The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 5 | - Year of publication:- 2018 6 | 7 | ## Summary 8 | 9 | This paper proposes a generative architecture, which is an enhanced version of proGAN, which helps in unsupervised separation of the high-level attributes(like pose and identity in humans) and of stochastic variation in the generated images(like hairs, etc.). The proposed method generates images of very high quality(like 1024x1024) and also improves the disentanglement of the latent vectors of variation. 10 | 11 | ## Main contributions 12 | 13 | - Replace the nearest-neighbour up/downsampling in both networks with bilinear sampling. 14 | 15 | - Instead of giving input as a latent vector, a learned constant input is given and at each convolution layer, a latent vector(w) is given to adjust the styles of the image. 16 | 17 | - The latent vector(w) is obtained by a non-linear mapping from a random generated latent vector(z) using 8 layer MLP. 18 | w is then passed to affine transformation which helps to control adaptive instance normalization (AdaIN) operations after each convolution layer. The AdaIn replaces the pixel norm that was used in the original proGAN architecture. 19 | 20 | - Gaussian noise is injected after every convolution layer, which helps in the unsupervised separation of the high-level attributes and the stochastic variation. 21 | 22 | - Use of mixing regularization(using two latent vectoers instead of one) for increasing the localisation of the styles , which increases the FID score. 23 | 24 | - Proposal of two new metrics for the evaluating disentanglement, which do not require an encoder network :-> perceptual path-length and linear separability 25 | 26 | - A new dataset of human faces(Flickr-Faces-HQ, FFHQ) with much higher quality and diversity is presented. 27 | 28 | ## Generator network architecture 29 | 30 | ![Generator network architecure](https://miro.medium.com/v2/resize:fit:1400/1*_VUXFwZGAYke4_GFqCgqlQ.png) 31 | 32 | ## Implementation details 33 | 34 | - The generator architecture is modified from the generator architecture of the original [proGAN](https://arxiv.org/pdf/1710.10196v3.pdf) paper. 35 | 36 | ## Two-cent 37 | 38 | - The main idea of the proposed paper is to get a better control over the image synthesis process. 39 | 40 | - The mapping network and the affine tramsformations draw samples for each style from a learned distribution, while the synthesis network generates images based on collection of those styles. 41 | 42 | - There is no modification in the discriminator architecture and the loss functions of proGAN, the major change is in the generator architecture. 43 | 44 | - The collection of styles(in human faces) is like 45 | - Coarse styles(4x4 - 8x8) :-> pose, hair, face shape 46 | - Middle styles(16x16 - 32x32) :-> facial features, eyes 47 | - Fine styles(64x64 - 1024x1024) :-> color scheme 48 | 49 | ## Resources 50 | 51 | - Paper :- https://arxiv.org/pdf/1812.04948.pdf 52 | 53 | - Demonstrative Video :- https://youtu.be/kSLJriaOumA 54 | 55 | - Implementation :- https://github.com/NVlabs/stylegan 56 | -------------------------------------------------------------------------------- /summaries/this_looks_like_that.md: -------------------------------------------------------------------------------- 1 | # This Looks Like That: Deep Learning for Interpretable Image Recognition 2 | Chaofan Chen, Oscar Li, Chaofan Tao, Alina Jade Barnett, Jonathan Su, Cynthia Rudin 3 | 4 | ## Summary 5 | This paper proposes a novel idea for interpretable deep learning , it basically figures out some protopyical parts of images by itself , and then uses these prototypes to make classification , hence making the classification process interpretable. 6 | **Among the top 3% accepted papers of NIPS 2019** 7 | 8 | ## Architecture Details 9 | - Intially we have the first 13 layers of the VGG-16 network ,followed by 2 , 1x1 convs 10 | - The network learns m **prototypes** (pj) , with a predefined number of prototypes for each class . 11 | - Each prototpye is applied to all the patches of the conv output , and using the L2 distance , a **similarity score is produced for a single prototype for all the patches** , this can also be used to make a heatmap of similarity . 12 | - Then global pooling is applied to convert this into a single score **gpj**.Which represents the strongest similar finding of that prototype in the image. 13 | - Then the m global similarity scores are feeded in the FC layer to perform classification. 14 | 15 | 16 | ## Overview 17 | - The network basically learns from the training set, a limited number of **prototypical parts** that are useful in classifying a new image. 18 | - The model is able to identify several parts of the image where it thinks that this identified part of the image looks like that prototypical part of some training image, and makes its prediction based on a weighted combination of the **similarity scores** between parts of the image and the learned prototypes. 19 | 20 | 21 | 22 | - There are 3 training stages 23 | 1. **SGD of the layers before the FC layer h** - This aims to learn a meaningful latent space where the most important patches for classifying images are clustered around prototypes associated with their own classes (Clustering Cost), and those important patches from different classes will be separated into distinct clusters(Separation Cost) . The loss function is sum of the cross entropy loss , the clustering cost and the seperation cost . 24 | 25 | 2. **Projection of the prototypes onto the closest latent representations of training image patches from the same class as that of the prototype.** - We push each prototype pj onto the closest latent representation of training image patches from the same class as that of pj . So effectively each prototype now corresponds to some patch of a image in the training set 26 | 3. **Convex optimization of the last layer h. (Using Adam)** - The parametets of the conv , and the prototype layers are fixed . Also the weights of the last layer are encouraged to be sparse by adding L1 regularisation. 27 | 28 | 29 | 30 | 31 | - The nearest prototypes of a given image are mostly prototypes associated with the class of the image 32 | - The nearest patches of a given prototype mostly come from those images in the same class as that of the prototype 33 | - Similar parts are consistently highlighted when the original images containing the nearest patches of a given prototype are passed through the network to generate the activation maps. 34 | 35 | ## Strengths 36 | - The classification process is very interpretable , as we can see which prototype got the maximum similarity and also we can make an activation map to identify which part of the image was repsonsible for classification. 37 | - In more high stakes decisions like detecting breast cancer this interpretability is very important , as we need to know examine the process by which the network classifies. 38 | - The network can achieve comparable accuracy with its analogous standard non-interpretable counterpart as well as other interpretable deep models. 39 | 40 | ## Weaknesses 41 | - There are sometimes same prototypes , which indicates a lack of representational power. 42 | - Sometimes similar prototypes of other classes also have high similarity scores , leading to wrong classification. 43 | 44 | ## Implementation 45 | - Work in progress (as a part of the NeurIPS 2019 Reproduciblity Challenge) 46 | 47 | 48 | -------------------------------------------------------------------------------- /summaries/uniform_convergence.md: -------------------------------------------------------------------------------- 1 | # Uniform convergence may be unable to explain generalization in deep learning 2 | Vaishnavh Nagarajan, J. Zico Kolter 3 | 4 | ## Summary 5 | This paper proposes an argument against the use of uniform convergence based generalization bounds to explain why overparameterized deep networks generalize well. , they do so by failing the tightest (algorithm, distribution)-dependent uniform convergence bound. 6 | 7 | ## Background Information about the problem 8 | 9 | - The author first reviews how the standard uniform convergence based bounds works 10 | - Classical uniform convergence based approaches bound Test error by `test error <= Train error + O(some complexity term)/sqrt(training set size))` which fails in the over parameterized setting due to the complexity overshooting. 11 | - In the modern approach, it is tried to find weights which are implicitly regularised when trained on real data and then apply uniform convergence on them (simpler, norm bounded class of functions) to yield better bounds 12 | 13 | ## Main Experiment 14 | 15 | 16 | - They took two uniform hypersphere distributions in 1000 dimensions which are very close 17 | - The task was to separate these two hyperspheres, then they trained a 1 hidden layer wide neural network using SGD on this dataset. 18 | - It was observed that as the training dataset size increases ...the test error decreases or generalization increases. 19 | - But now comes the main catch, they now project each h training point onto its opposite hypersphere and flip its label, they then try to classify this ...and the network misclassifies all of these projected data points. 20 | - So the error on this new dataset is high, even though the error on random test inputs is small. 21 | - Intuitively, the misclassification of this new data means that the learned decision boundary near each training datapoint is skewed into the opposite class, resulting in the closest point from the opposite class to be misclassified. 22 | - The skewing around datapoints also means that the learned decision boundary is “complex” enough to “memorize” the locations of the training data (although at the same time, the decision boundary generalizes well). 23 | 24 |

25 | 26 |

27 | 28 | ## Main Contributions 29 | 30 | - The author suggested that the modern approaches for bounds either still depend on the parameter count to a large extent or if not, they hold only on a modified network, which has gone through different kinds of explicit post-processing. 31 | 32 | - They observe that as we increase the dataset size, the test error decreases with the training set size, as expected. On the other hand, the bounds increase with dataset size, this increase even though the denominator in these generalization bounds already has a training set size is because of the weight norms of the network increase significantly with the dataset size. 33 | 34 | 35 | - In other words, they emphasize that we must also worry about training-data-size dependence of these generalization bounds to fully explain generalization in deep learning and not only focus on the parameter count dependence. 36 | 37 | - They also supposedly disprove uniform convergence in its best-case scenario ...ie.....via the hemisphere experiment, they give multiple mathematical theorems and proofs for it. 38 | 39 | - So the main takeaway here is that overparameterized models trained by gradient descent can have certain complexities in their boundaries, and these complexities can hurt uniform convergence, without hurting generalization 40 | 41 | ## Our Two Cents 42 | 43 | - The experiment proposed in this paper is very simple but also quite thought-provoking ....a rare feature in the deep learning world. 44 | - This paper proposes the need to think about new methods to explain generalization in deep neural networks and thus is highly research encouraging. 45 | 46 | 47 | ## References and Further reads 48 | - https://locuslab.github.io/2019-07-09-uniform-convergence 49 | - http://www.cs.cmu.edu/~vaishnan/talks/icml19_ws_uc_slides.pdf 50 | -------------------------------------------------------------------------------- /summaries/vgraph.md: -------------------------------------------------------------------------------- 1 | 2 | # vGraph: A Generative Model for Joint Community Detection and Node Representational Learning 3 | 4 | Fan-Yun Sun, Meng Qu, Jordan Hoffmann, Chin-Wei Huang, Jian Tang, NIPS-2019 5 | 6 | ## Summary 7 | 8 | This paper proposes a novel technique for learning node representations and at the same time perform community detection task for the graphical data by creating a generative model using the variational inference concepts. 9 | 10 | ## Main Contributions 11 | 12 | - The authors propose a generative model for jointly learning community membership and node representations, the two tasks even though being highly correlated, studied mostly independently in the previous literature. 13 | 14 | - For achieving this, it is assumed that each node can be represented as a mixture of communities, and each community is defined as a multinomial distribution over nodes. Hence we leverage node as well as community embeddings to generate an edge given a node. 15 | 16 | 17 | 18 | - Given a edge between nodes u & v, to obtain a generative model between them, firstly using the fact that each node can be represented a mixture of communities, we find *p(z|u)*. Once the community assignment(z) is done and since each community is assumed as a multinomial distribution over nodes, we find the node mostly likely on the other side of the edge determined using the community assignment *p(v|z)*. 19 | 20 | - So the generation process can be formulated in a probabilistic way as: **p(c|w) = p(c|z)p(z|w)** (summation over all values of z) where *p(z|w)* is the community assignment and can be calculated as a softmax over numerous possible community assignment derived after multiplication of node embeddings with community embeddings. Also *p(c|z)* is the node assignment derived by a softmax over the numerous possible nodes derived after multiplication of community and destination node embedding. 21 | 22 | - So as a result we have to maximize the likelihood function **p(c|w)**, however since the process is intractable, and hence we use the concepts of variational inference and rather maximize **ELBO** as our loss function. 23 | 24 | - So to do this just like in the case of normal VAEs, we obtain an approximation for the true posterior distribution *p(z|c,w)* as *q(z|c,w)*, which can further be used to find out the node by using *p(c|z)*. So the ELBO consists of a Cross entropy loss between reconstructed(using w) and original nodes (*c*) and KL-divergence loss between the distributions *q(z|c,w)* and *p(z|w)* (i.e. between the approximated posterior distribution and prior) 25 | 26 | - However to obtain the recontructed node from *q(z|c,w)* we need to have a reparametrization trick to make the model end-to-end, however the community assignment are discrete rather than continuous in the case of normal VAEs(the latent variable can be considered as a community assignment in that case). So rather we use **Gumbel Reparametrization trick** to get a rather discrete assignment while at the same time maintaining the gradient flow of the system. 27 | 28 | - Also to incorporate the tendency of assigning of similar communities to nearby nodes, a regularization is also added incentivizing the distributions *p(z|w)* and *p(z|c)* to be closer. 29 | 30 | ## Implementation Details 31 | 32 | - While calculating *p(c|z)*, since the softmax is taken over all nodes rendering the technique non-scalable for bigger graphs, **negative sampling technique** can be used for more efficient calculation in such cases. However usage of negative sampling technique also at the same time leads to slower optimization. 33 | 34 | - The usage of Gumbel reparametrization technique leads to an end-to-end model even while leveraging discrete community assignment. More information about the technique can be found [here](https://casmls.github.io/general/2017/02/01/GumbelSoftmax.html) 35 | 36 | ## Our Two Cents 37 | 38 | - The novel technique for performing a joint task by learning a generative model is able to produce SOTA results on various datasets beating out the other technique **(ComE)** that also performed both the tasks at the same time. Also the usage of negative sampling technique makes the methods extremely scalable to large datasets. 39 | 40 | - While the generative model is able to leverage the closeness and neighbours(using the edge information) through its optimization to obtain correct community detection, it is still not able to leverage the node features which can be improved upon in further work. 41 | 42 | ## Implementation 43 | 44 | - [https://github.com/fanyun-sun/vGraph](https://github.com/fanyun-sun/vGraph) 45 | - [https://github.com/aniket-agarwal1999/vGraph-Pytorch](https://github.com/aniket-agarwal1999/vGraph-Pytorch) -------------------------------------------------------------------------------- /summaries/vision_attention.md: -------------------------------------------------------------------------------- 1 | # Stand-Alone Self-Attention in Vision Models 2 | 3 | Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, Jonathon Shlens 4 | 5 | ## Summary 6 | 7 | This paper proposes a method to entirely remove convolutions by pure self attention based vision models 8 | 9 | ## Main contribution 10 | 11 | - CNN's have problem in capturing long range interactions because they don't scale well with large receptive fields. 12 | 13 | - Attention usally allows us to deal with such long range interactions, this paper basically provides a way to develop a purely self-attention based vision model. 14 | 15 | - Previous works used global attention, ie. attention between all the pixels, which required considerable downsampling due to computational overhead. This paper works on local attention ie. attention applied to a small neighbourhood around the pixel. 16 | 17 | - Instead of the sinusoidal positional embeddings as in transformer they use a concept of relative attention which leads to a translational equivariance same as convolutions. 18 | 19 | - The paper suggests that its better to use convs in earlier layers that they called `stem`,as in the early layers content is comprised of RGB pixels that are individually uninformative and heavily spatially correlated. This property makes learning useful features such as edge detectors difficult for content-based mechanisms such as self-attention. 20 | 21 | - Convolutions may better capture low level features while attention layers better capture global information. 22 | 23 | 24 | 25 | ## Methodology 26 | 27 | - Self attention is similar to transformer, consisting of key,value,query pairs. Query of a pixel is dot-producted with keys of all local pixels, and softmax is applied to get weights, which are used to find a weighted sum of the values to get the output. The input pixel feature is of `din` dimension which is transformed to `dout` for each of key,value and queries, using different linear transformations. 28 | 29 | - They also perform multi-head attention by dividing the input pixel features of size (din) into equal parts depthwise (din/n) and then applying attention to all of them individually (dout/n) and then concatenating them to get the output (dout) 30 | 31 | 32 | - The relative attention is applied by dividing each local pixel a row and column offset embedding of size (dout/2) based on its relative row/column distance from the pixel. These two embeddings are concatenated and used along with keys for attention weight calculation. 33 | 34 | - For replacing convolutions, 1x1 convs are left as such since they are simply equivalent to a single layer FCN. All other convs are replaced by attention layers and 2x2 average pooling with stride 2 is used wherever downsampling is required. 35 | 36 | - Convs have a distance based weight parametrization meaning that they have a kind of spatial structure in their filters/weights which allows them to easily learn edge detectors and other stuff. To introduce this to attention they also calculate the value vector by spatially-varying linear transformations ie. value is calculated by a weighted sum of various linear transformation, with the weights being determined by the relative position. This improves performance whilst having similar FLOPS. 37 | 38 | - The best performing models use convolutions in the early groups and attention in the later groups.. In contrast, when attention is used in the early groups and convolutions are used in the later groups, the performance degrades. 39 | 40 | ## Strengths 41 | 42 | - The parameter count of attention is independent of the spatial extent (it depends on din and dout), this a significant improvement over convolutions which vary quadratically over the spatial extent. 43 | 44 | - The computational cost of attention 45 | also grows slower with spatial extent compared to convolution, a convolution layer with k = 3 has the same computational cost 46 | as an attention layer with k = 19. 47 | 48 | - The approach seems to be bounded by the current unavaibilty of highly optimised kernels as in the case of convolutions, which if done might further improve its results. 49 | 50 | ## Weakness 51 | 52 | - The implementation of relative postional embeddings seems lacking. 53 | 54 | - The calcluations comparing the computational/parameteric cost of attention vs convolution lacks are simply stated and not explained. 55 | --------------------------------------------------------------------------------