├── .gitignore ├── CODE_OF_CONDUCT.md ├── LICENSE ├── README.md ├── SECURITY.md ├── SUPPORT.md ├── auth.env ├── breakfast-dataset ├── README.md ├── compute_mof_iou_f1.py ├── label_data_estimate_baseline_breakfast.json └── label_data_gt_breakfast.json ├── docs ├── index.html └── src │ ├── arxiv.png │ ├── github-mark.png │ ├── pipeline.jpg │ ├── qualitative_results.jpg │ ├── table.jpg │ └── top-level-schema.jpg ├── example.py ├── finegrained-breakfast-dataset ├── .gitignore ├── README.md ├── clip_original_videos.py ├── compute_mof_iou_f1.py ├── label_data_estimate_baseline.json ├── label_data_gt_right.json └── original_videos │ └── original_videos.txt ├── requirements.txt ├── results ├── Grasping_the_can │ ├── Grasping_the_can._segment_0.5_1.4.mp4 │ └── grid_image_sample.png ├── Moving_the_can_upwards │ ├── Moving_the_can_upwards_segment_2.1_4.9.mp4 │ └── grid_image_sample.png └── Releasing_the_can_placed_on_the_shelf │ ├── Releasing_the_can_placed_on_the_shelf_segment_4.5_4.9.mp4 │ └── grid_image_sample.png ├── sample_video └── sample.mp4 ├── src └── pipeline.jpg └── thumos14-dataset ├── README.md └── label_data_estimate_thumos14.txt /.gitignore: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/.gitignore -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/CODE_OF_CONDUCT.md -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/LICENSE -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/README.md -------------------------------------------------------------------------------- /SECURITY.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/SECURITY.md -------------------------------------------------------------------------------- /SUPPORT.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/SUPPORT.md -------------------------------------------------------------------------------- /auth.env: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/auth.env -------------------------------------------------------------------------------- /breakfast-dataset/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/breakfast-dataset/README.md -------------------------------------------------------------------------------- /breakfast-dataset/compute_mof_iou_f1.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/breakfast-dataset/compute_mof_iou_f1.py -------------------------------------------------------------------------------- /breakfast-dataset/label_data_estimate_baseline_breakfast.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/breakfast-dataset/label_data_estimate_baseline_breakfast.json -------------------------------------------------------------------------------- /breakfast-dataset/label_data_gt_breakfast.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/breakfast-dataset/label_data_gt_breakfast.json -------------------------------------------------------------------------------- /docs/index.html: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/docs/index.html -------------------------------------------------------------------------------- /docs/src/arxiv.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/docs/src/arxiv.png -------------------------------------------------------------------------------- /docs/src/github-mark.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/docs/src/github-mark.png -------------------------------------------------------------------------------- /docs/src/pipeline.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/docs/src/pipeline.jpg -------------------------------------------------------------------------------- /docs/src/qualitative_results.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/docs/src/qualitative_results.jpg -------------------------------------------------------------------------------- /docs/src/table.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/docs/src/table.jpg -------------------------------------------------------------------------------- /docs/src/top-level-schema.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/docs/src/top-level-schema.jpg -------------------------------------------------------------------------------- /example.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/example.py -------------------------------------------------------------------------------- /finegrained-breakfast-dataset/.gitignore: -------------------------------------------------------------------------------- 1 | out/ -------------------------------------------------------------------------------- /finegrained-breakfast-dataset/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/finegrained-breakfast-dataset/README.md -------------------------------------------------------------------------------- /finegrained-breakfast-dataset/clip_original_videos.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/finegrained-breakfast-dataset/clip_original_videos.py -------------------------------------------------------------------------------- /finegrained-breakfast-dataset/compute_mof_iou_f1.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/finegrained-breakfast-dataset/compute_mof_iou_f1.py -------------------------------------------------------------------------------- /finegrained-breakfast-dataset/label_data_estimate_baseline.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/finegrained-breakfast-dataset/label_data_estimate_baseline.json -------------------------------------------------------------------------------- /finegrained-breakfast-dataset/label_data_gt_right.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/finegrained-breakfast-dataset/label_data_gt_right.json -------------------------------------------------------------------------------- /finegrained-breakfast-dataset/original_videos/original_videos.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/finegrained-breakfast-dataset/original_videos/original_videos.txt -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | openai 2 | opencv-python 3 | -------------------------------------------------------------------------------- /results/Grasping_the_can/Grasping_the_can._segment_0.5_1.4.mp4: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/results/Grasping_the_can/Grasping_the_can._segment_0.5_1.4.mp4 -------------------------------------------------------------------------------- /results/Grasping_the_can/grid_image_sample.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/results/Grasping_the_can/grid_image_sample.png -------------------------------------------------------------------------------- /results/Moving_the_can_upwards/Moving_the_can_upwards_segment_2.1_4.9.mp4: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/results/Moving_the_can_upwards/Moving_the_can_upwards_segment_2.1_4.9.mp4 -------------------------------------------------------------------------------- /results/Moving_the_can_upwards/grid_image_sample.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/results/Moving_the_can_upwards/grid_image_sample.png -------------------------------------------------------------------------------- /results/Releasing_the_can_placed_on_the_shelf/Releasing_the_can_placed_on_the_shelf_segment_4.5_4.9.mp4: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/results/Releasing_the_can_placed_on_the_shelf/Releasing_the_can_placed_on_the_shelf_segment_4.5_4.9.mp4 -------------------------------------------------------------------------------- /results/Releasing_the_can_placed_on_the_shelf/grid_image_sample.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/results/Releasing_the_can_placed_on_the_shelf/grid_image_sample.png -------------------------------------------------------------------------------- /sample_video/sample.mp4: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/sample_video/sample.mp4 -------------------------------------------------------------------------------- /src/pipeline.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/src/pipeline.jpg -------------------------------------------------------------------------------- /thumos14-dataset/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/thumos14-dataset/README.md -------------------------------------------------------------------------------- /thumos14-dataset/label_data_estimate_thumos14.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/microsoft/VLM-Video-Action-Localization/HEAD/thumos14-dataset/label_data_estimate_thumos14.txt --------------------------------------------------------------------------------