├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── DATASETS.md ├── DOCUMENTATION.md ├── INFERENCE.md ├── LICENSE ├── README.md ├── common ├── arguments.py ├── camera.py ├── custom_dataset.py ├── generators.py ├── h36m_dataset.py ├── humaneva_dataset.py ├── loss.py ├── mocap_dataset.py ├── model.py ├── quaternion.py ├── skeleton.py ├── utils.py └── visualization.py ├── data ├── ConvertHumanEva.m ├── convert_cdf_to_mat.m ├── data_utils.py ├── prepare_data_2d_custom.py ├── prepare_data_2d_h36m_generic.py ├── prepare_data_2d_h36m_sh.py ├── prepare_data_h36m.py └── prepare_data_humaneva.py ├── images ├── batching.png ├── convolutions_1f_naive.png ├── convolutions_1f_optimized.png ├── convolutions_anim.gif ├── convolutions_causal.png ├── convolutions_normal.png ├── demo_h36m.gif ├── demo_humaneva.gif ├── demo_humaneva_unlabeled.gif ├── demo_temporal.gif └── demo_yt.gif ├── inference ├── infer_video.py └── infer_video_d2.py └── run.py /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Code of Conduct 2 | Facebook has adopted a Code of Conduct that we expect project participants to adhere to. Please [read the full text](https://code.facebook.com/codeofconduct) so that you can understand what actions will and will not be tolerated. -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing 2 | We want to make contributing to this project as easy and transparent as 3 | possible. 4 | 5 | ## Pull Requests 6 | We actively welcome your pull requests. 7 | 8 | 1. Fork the repo and create your branch from `master`. 9 | 2. If you've added code that should be tested, add tests. 10 | 3. If you've changed APIs, update the documentation. 11 | 4. Ensure the test suite passes. 12 | 5. Make sure your code lints. 13 | 6. If you haven't already, complete the Contributor License Agreement ("CLA"). 14 | 15 | ## Contributor License Agreement ("CLA") 16 | In order to accept your pull request, we need you to submit a CLA. You only need 17 | to do this once to work on any of Facebook's open source projects. 18 | 19 | Complete your CLA here: 20 | 21 | ## Issues 22 | We use GitHub issues to track public bugs. Please ensure your description is 23 | clear and has sufficient instructions to be able to reproduce the issue. 24 | 25 | ## Coding Style 26 | We follow the [PEP 8](https://www.python.org/dev/peps/pep-0008/) style guidelines. 27 | 28 | ## License 29 | By contributing to this project, you agree that your contributions will be licensed 30 | under the LICENSE file in the root directory of this source tree. -------------------------------------------------------------------------------- /DATASETS.md: -------------------------------------------------------------------------------- 1 | # Dataset setup 2 | 3 | ## Human3.6M 4 | We provide two ways to set up the Human3.6M dataset on our pipeline. You can either convert the original dataset (recommended) or use the [dataset preprocessed by Martinez et al.](https://github.com/una-dinosauria/3d-pose-baseline) (no longer available as of May 22nd, 2020). The two methods produce the same result. After this step, you should end up with two files in the `data` directory: `data_3d_h36m.npz` for the 3D poses, and `data_2d_h36m_gt.npz` for the ground-truth 2D poses. 5 | 6 | ### Setup from original source (recommended) 7 | **Update:** we have updated the instructions to simplify the procedure. MATLAB is no longer required for this step. 8 | 9 | Register to the [Human3.6m website](http://vision.imar.ro/human3.6m/) website (or login if you already have an account) and download the dataset in its original format. You only need to download *Poses -> D3 Positions* for each subject (1, 5, 6, 7, 8, 9, 11) 10 | 11 | ##### Instructions without MATLAB (recommended) 12 | You first need to install `cdflib` Python library via `pip install cdflib`. 13 | 14 | Extract the archives named `Poses_D3_Positions_S*.tgz` (subjects 1, 5, 6, 7, 8, 9, 11) to a common directory. Your directory tree should look like this: 15 | 16 | ``` 17 | /path/to/dataset/S1/MyPoseFeatures/D3_Positions/Directions 1.cdf 18 | /path/to/dataset/S1/MyPoseFeatures/D3_Positions/Directions.cdf 19 | ... 20 | ``` 21 | 22 | Then, run the preprocessing script: 23 | ```sh 24 | cd data 25 | python prepare_data_h36m.py --from-source-cdf /path/to/dataset 26 | cd .. 27 | ``` 28 | 29 | If everything goes well, you are ready to go. 30 | 31 | ##### Instructions with MATLAB (old instructions) 32 | First, we need to convert the 3D poses from `.cdf` to `.mat`, so they can be loaded from Python scripts. To this end, we have provided the MATLAB script `convert_cdf_to_mat.m` in the `data` directory. Extract the archives named `Poses_D3_Positions_S*.tgz` (subjects 1, 5, 6, 7, 8, 9, 11) to a directory named `pose`, and set up your directory tree so that it looks like this: 33 | 34 | ``` 35 | /path/to/dataset/convert_cdf_to_mat.m 36 | /path/to/dataset/pose/S1/MyPoseFeatures/D3_Positions/Directions 1.cdf 37 | /path/to/dataset/pose/S1/MyPoseFeatures/D3_Positions/Directions.cdf 38 | ... 39 | ``` 40 | Then run `convert_cdf_to_mat.m` from MATLAB. 41 | 42 | Finally, run the Python conversion script specifying the dataset path: 43 | ```sh 44 | cd data 45 | python prepare_data_h36m.py --from-source /path/to/dataset/pose 46 | cd .. 47 | ``` 48 | 49 | ### Setup from preprocessed dataset (old instructions) 50 | **Update:** the link to the preprocessed dataset is no longer available; please use the procedure above. These instructions have been kept for backwards compatibility in case you already have a copy of this archive. All procedures produce the same result. 51 | 52 | Download the [~~h36m.zip archive~~](https://www.dropbox.com/s/e35qv3n6zlkouki/h36m.zip) (source: [3D pose baseline repository](https://github.com/una-dinosauria/3d-pose-baseline)) to the `data` directory, and run the conversion script from the same directory. This step does not require any additional dependency. 53 | 54 | ```sh 55 | cd data 56 | wget https://www.dropbox.com/s/e35qv3n6zlkouki/h36m.zip 57 | python prepare_data_h36m.py --from-archive h36m.zip 58 | cd .. 59 | ``` 60 | 61 | ## 2D detections for Human3.6M 62 | We provide support for the following 2D detections: 63 | 64 | - `gt`: ground-truth 2D poses, extracted through the camera projection parameters. 65 | - `sh_pt_mpii`: Stacked Hourglass detections (model pretrained on MPII, no fine tuning). 66 | - `sh_ft_h36m`: Stacked Hourglass detections, fine-tuned on Human3.6M. 67 | - `detectron_pt_h36m`: Detectron (Mask R-CNN) detections (model pretrained on COCO, no fine tuning). 68 | - `detectron_ft_h36m`: Detectron (Mask R-CNN) detections, fine-tuned on Human3.6M. 69 | - `cpn_ft_h36m_dbb`: Cascaded Pyramid Network detections, fine-tuned on Human3.6M. Bounding boxes from `detectron_ft_h36m`. 70 | - User-supplied (see below). 71 | 72 | The 2D detection source is specified through the `--keypoints` parameter, which loads the file `data_2d_DATASET_DETECTION.npz` from the `data` directory, where `DATASET` is the dataset name (e.g. `h36m`) and `DETECTION` is the 2D detection source (e.g. `sh_pt_mpii`). Since all the files are encoded according to the same format, it is trivial to create a custom set of 2D detections. 73 | 74 | Ground-truth poses (`gt`) have already been extracted by the previous step. The other detections must be downloaded manually (see instructions below). You only need to download the detections you want to use. For reference, our best results on Human3.6M are achieved by `cpn_ft_h36m_dbb`. 75 | 76 | ### Mask R-CNN and CPN detections 77 | You can download these directly and put them in the `data` directory. We recommend starting with: 78 | 79 | ```sh 80 | cd data 81 | wget https://dl.fbaipublicfiles.com/video-pose-3d/data_2d_h36m_cpn_ft_h36m_dbb.npz 82 | wget https://dl.fbaipublicfiles.com/video-pose-3d/data_2d_h36m_detectron_ft_h36m.npz 83 | cd .. 84 | ``` 85 | 86 | These detections have been produced by models fine-tuned on Human3.6M. We adopted the usual protocol of fine-tuning on 5 subjects (S1, S5, S6, S7, and S8). We also included detections from the unlabeled subjects S2, S3, S4, which can be loaded by our framework for semi-supervised experimentation. 87 | 88 | Optionally, you can download the Mask R-CNN detections without fine-tuning if you want to experiment with these: 89 | ```sh 90 | cd data 91 | wget https://dl.fbaipublicfiles.com/video-pose-3d/data_2d_h36m_detectron_pt_coco.npz 92 | cd .. 93 | ``` 94 | 95 | ### Stacked Hourglass detections 96 | These detections (both pretrained and fine-tuned) are provided by [Martinez et al.](https://github.com/una-dinosauria/3d-pose-baseline) in their repository on 3D human pose estimation. The 2D poses produced by the pretrained model are in the same archive as the dataset ([h36m.zip](https://www.dropbox.com/s/e35qv3n6zlkouki/h36m.zip)). The fine-tuned poses can be downloaded [here](https://drive.google.com/open?id=0BxWzojlLp259S2FuUXJ6aUNxZkE). Put the two archives in the `data` directory and run: 97 | 98 | ```sh 99 | cd data 100 | python prepare_data_2d_h36m_sh.py -pt h36m.zip 101 | python prepare_data_2d_h36m_sh.py -ft stacked_hourglass_fined_tuned_240.tar.gz 102 | cd .. 103 | ``` 104 | 105 | ## HumanEva-I 106 | For HumanEva, you need the original dataset and MATLAB. We provide a MATLAB script to extract the revelant parts of the dataset automatically. 107 | 108 | 1. Download the [HumanEva-I dataset](http://humaneva.is.tue.mpg.de/datasets_human_1) and extract it. 109 | 2. Download the official [source code v1.1 beta](http://humaneva.is.tue.mpg.de/main/download?file=Release_Code_v1_1_beta.zip) and extract it where you extracted the dataset. 110 | 3. Copy the contents of the directory `Release_Code_v1_1_beta\HumanEva_I` to the root of the source tree (`Release_Code_v1_1_beta/`). 111 | 4. Download the [critical dataset update](http://humaneva.is.tue.mpg.de/main/download?file=Critical_Update_OFS_files.zip) and apply it. 112 | 5. **Important:** for visualization purposes, the original code requires an old library named *dxAvi*, which is used for decoding XVID videos. A precompiled binary for 32-bit architectures is already included, but if you are running MATLAB on a 64-bit system, the code will not work. You can either recompile *dxAvi* library for x64, or bypass it entirely, since we are not using visualization features in our conversion script. To this end, you can patch `@sync_stream/sync_stream.m`, replacing line 202: `ImageStream(I) = image_stream(image_paths{I}, start_image_offset(I));` with `ImageStream(I) = 0;` 113 | 6. Now you can copy our script `ConvertHumanEva.m` (from `data/`) to `Release_Code_v1_1_beta/`, and run it. It will create a directory named `converted_15j`, which contains the converted 2D/3D ground-truth poses on a 15-joint skeleton. 114 | 7. **Optional:** if you want to experiment with a 20-joint skeleton, change `N_JOINTS` to 20 in `ConvertHumanEva.m`, and repeat the process. It will create a directory named `converted_20j`. Adapt next steps accordingly. 115 | 116 | If you get warnings about mocap errors or dropped frames, this is normal. The HumanEva dataset contains some invalid frames due to occlusions, which are simply discarded. Since we work with videos (and not individual frames), we try to minimize the impact of this issue by grouping valid sequences into contiguous chunks. 117 | 118 | Finally, run the Python script to produce the final files: 119 | ``` 120 | python prepare_data_humaneva.py -p /path/to/dataset/Release_Code_v1_1_beta/converted_15j --convert-3d 121 | ``` 122 | You should end up with two files in the `data` directory: `data_3d_humaneva15.npz` for the 3D poses, and `data_2d_humaneva15_gt.npz` for the ground-truth 2D poses. 123 | 124 | ### 2D detections for HumanEva-I 125 | We provide support for the following 2D detections: 126 | 127 | - `gt`: ground-truth 2D poses, extracted through camera projection. 128 | - `detectron_pt_coco`: Detectron (Mask R-CNN) detections, pretrained on COCO. 129 | 130 | Since HumanEva is very small, we do not fine-tune the pretrained models. As before, you can download Mask R-CNN detections from AWS (`data_2d_humaneva15_detectron_pt_coco.npz`, which must be copied to `data/`). As before, we have included detections for unlabeled subjects/actions. These begin with the prefix `Unlabeled/`. Chunks that correspond to corrupted motion capture streams are also marked as unlabeled. 131 | ```sh 132 | cd data 133 | wget https://dl.fbaipublicfiles.com/video-pose-3d/data_2d_humaneva15_detectron_pt_coco.npz 134 | cd .. 135 | ``` -------------------------------------------------------------------------------- /DOCUMENTATION.md: -------------------------------------------------------------------------------- 1 | # Documentation 2 | This guide explains in depth all the features of this framework. Make sure you have read the quick start guide in [`README.md`](README.md) before proceeding. 3 | 4 | ## Training 5 | By default, the script `run.py` runs in training mode. The list of command-line arguments is defined in `common/arguments.py`. 6 | 7 | - `-h`: shows the help / list of parameters. 8 | - `-d` or `--dataset`: specifies the dataset to use (`h36m` or `humaneva15`). Default: `h36m`. If you converted the 20-joint HumanEva skeleton, you can also use `humaneva20`. 9 | - `-k` or `--keypoints`: specifies the 2D detections to use. Default: `cpn_ft_h36m_dbb` (CPN fine-tuned on Human 3.6M). 10 | - `-c` or `--checkpoint`: specifies the directory where checkpoints are saved/read. Default: `checkpoint`. 11 | - `--checkpoint-frequency`: save checkpoints every N epochs. Default: `10`. 12 | - `-r` or `--resume`: resume training from a particular checkpoint (you should only specify the file name, not the path), e.g. `epoch_10.bin`. 13 | - `-str` or `--subjects-train`: specifies the list of subjects on which the model is trained, separated by commas. Default: `S1,S5,S6,S7,S8`. For HumanEva, you may want to specify these manually. 14 | - `-ste` or `--subjects-test`: specifies the list of subjects on which the model is tested at the end of each epoch (and in the final evaluation), separated by comma. Default: `S9,S11`. For HumanEva, you may want to specify these manually. 15 | - `-a` or `--actions`: select only a subset of actions, separated by commas. E.g. `Walk,Jog`. By default, all actions are used. 16 | - `-e` or `--epochs`: train for N epochs, i.e. N passes over the entire training set. Default: `60`. 17 | - `--no-eval`: disable testing at the end of each epoch (marginal speed up). By default, testing is enabled. 18 | - `--export-training-curves`: export training curves as PNG images after every epoch. They are saved in the checkpoint directory. Default: disabled. 19 | 20 | 21 | If `--no-eval` is not specified, the model is tested at the end of each epoch, although the reported metric is merely an approximation of the final result (for performance reasons). Once training is over, the model is automatically tested using the full procedure. This means that you can also specify the testing parameters when training. 22 | 23 | Here is a description of the model hyperparameters: 24 | - `-s` or `--stride`: the chunk size used for training, i.e. the number of frames that are predicted at once from each sequence. Increasing this value improves training speed at the expense of the error (due to correlated batch statistics). Default: `1` frame, which ensures maximum decorrelation. When this value is set to `1`, we also employ an optimized implementation of the model (see implementation details). 25 | - `-b` or `--batch-size`: the batch size used for training the model, in terms of *output frames* (regardless of the stride/chunk length). Default: `1024` frames. 26 | - `-drop` or `--dropout`: dropout probability. Default: `0.25`. 27 | - `-lr` or `--learning-rate`: initial learning rate. Default: `0.001`. 28 | - `-lrd` or `--lr-decay`: learning rate decay after every epoch (multiplicative coefficient). Default: `0.95`. 29 | - `-no-tta` or `--no-test-time-augmentation`: disable test-time augmentation (which is enabled by default), i.e. do not flip poses horizontally when testing the model. Only effective when combined with data augmentation, so if you disable this you should also disable train-time data augmentation. 30 | - `-no-da` or `--no-data-augmentation`: disable train-time data augmentation (which is enabled by default), i.e. do not flip poses horizontally to double the training data. 31 | - `-arc` or `--architecture`: filter widths (only odd numbers supported) separated by commas. This parameter also specifies the number of residual blocks, and determines the receptive field of the model. The first number refers to the input layer, and is followed by the filter widths of the residual blocks. For instance, `3,5,5` uses `3x1` convolutions in the first layer, followed by two residual blocks with `5x1` convolutions. Default: `3,3,3`. Some valid examples are: 32 | -- `-arc 3,3,3` (27 frames) 33 | -- `-arc 3,3,7` (63 frames) 34 | -- `-arc 3,3,3,3` (81 frames) 35 | -- `-arc 3,3,3,3,3` (243 frames) 36 | - `--causal`: use causal (i.e. asymmetric) convolutions instead of symmetric convolutions. Causal convolutions are suitable for real-time applications because they do not exploit future frames (they only look in the past), but symmetric convolutions result in a better error since they can consider both past and future data. See below for more details. Default: disabled. 37 | - `-ch` or `--channels`: number of channels in convolutions. Default: `1024`. 38 | - `--dense`: use dense convolutions instead of dilated convolutions. This is only useful for benchmarks and ablation experiments. 39 | - `--disable-optimizations`: disable the optimized implementation when `--stride` == `1`. This is only useful for benchmarks. 40 | 41 | ## Semi-supervised training 42 | Semi-supervised learning is only implemented for Human3.6M. 43 | 44 | - `-sun` or `--subjects-unlabeled`: specifies the list of unlabeled subjects that are used for semi-supervision (separated by commas). Semi-supervised learning is automatically enabled when this parameter is set. 45 | - `--warmup`: number of supervised training epochs before attaching the semi-supervised loss. Default: `1` epoch. You may want to increase this when downsampling the dataset. 46 | - `--subset`: reduce the size of the training set by a given factor (a real number). E.g. `0.1` uses one tenth of the training data. Subsampling is achieved by extracting a random contiguous chunk from each video, while preserving the original frame rate. Default: `1` (i.e. disabled). This parameter can also be used in a supervised setting, but it is especially useful to simulate data scarcity in a semi-supervised setting. 47 | - `--downsample`: reduce the dataset frame rate by an integer factor. Default: `1` (i.e. disabled). 48 | - `--no-bone-length`: do not add the bone length term to the unsupervised loss function (only useful for ablation experiments). 49 | - `--linear-projection`: ignore non-linear camera distortion parameters when performing projection to 2D, i.e. use only focal length and principal point. 50 | - `--no-proj`: do not add the projection consistency term to the loss function (only useful for ablations). 51 | 52 | ## Testing 53 | To test a particular model, you need to specify the checkpoint file via the `--evaluate` parameter, which will be loaded from the checkpoint directory (default: `checkpoint/`, but you can change it using the `-c` parameter). You also need to specify the same settings/hyperparameters that you used for training (e.g. input keypoints, architecture, etc.). The script will not run any compatibility checks -- this is a design choice to facilitate ablation experiments. 54 | 55 | ## Visualization 56 | You can render videos by specifying both `--evaluate` and `--render`. The script generates a visualization which contains three viewports: the 2D input keypoints (and optionally, a video overlay), the 3D reconstruction, and the 3D ground truth. 57 | Note that when you specify a video, the 2D detections are still loaded from the dataset according to the given parameters. It is up to you to choose the correct video. You can also visualize unlabeled videos -- in this case, the ground truth will not be shown. 58 | 59 | Here is a list of the command-line arguments related to visualization: 60 | - `--viz-subject`: subject to render, e.g. `S1`. 61 | - `--viz-action`: action to render, e.g. `Walking` or `Walking 1`. 62 | - `--viz-camera`: camera to render (integer), from 0 to 3 for Human3.6M, 0 to 2 for HumanEva. Default: `0`. 63 | - `--viz-video`: path to the 2D video to show. If specified, the script will render a skeleton overlay on top of the video. If not specified, a black background will be rendered instead (but the 2D detections will still be shown). 64 | - `--viz-skip`: skip the first N frames from the specified video. Useful for HumanEva. Default: `0`. 65 | - `--viz-output`: output file name (either a `.mp4` or `.gif` file). 66 | - `--viz-bitrate`: bitrate for MP4 videos. Default: `3000`. 67 | - `--viz-no-ground-truth`: by default, the videos contain three viewports: the 2D input pose, the 3D reconstruction, and the 3D ground truth. This flags removes the last one. 68 | - `--viz-limit`: render only first N frames. By default, all frames are rendered. 69 | - `--viz-downsample`: downsample videos by the specified factor, i.e. reduce the frame rate. E.g. if set to `2`, the frame rate is reduced from 50 FPS to 25 FPS. Default: `1` (no downsampling). 70 | - `--viz-size`: output resolution multiplier. Higher = larger images. Default: `5`. 71 | - `--viz-export`: export 3D joint coordinates (in camera space) to the specified NumPy archive. 72 | 73 | Example: 74 | ``` 75 | python run.py -k cpn_ft_h36m_dbb -arc 3,3,3,3,3 -c checkpoint --evaluate pretrained_h36m_cpn.bin --render --viz-subject S11 --viz-action Walking --viz-camera 0 --viz-video "/path/to/videos/S11/Videos/Walking.54138969.mp4" --viz-output output.gif --viz-size 3 --viz-downsample 2 --viz-limit 60 76 | ``` 77 | ![](images/demo_h36m.gif) 78 | 79 | Generates a visualization for S11/Walking from camera 0, and exports the first frames to a GIF animation with a frame rate of 25 FPS. If you remove the `--viz-video` parameter, the skeleton overlay will be rendered on a blank background. 80 | 81 | While Human3.6M visualization works out of the box, HumanEva visualization is trickier because the original videos must be segmented manually. Additionally, invalid frames and software synchronization complicate matters. Nonetheless, you can get decent visualizations by selecting the chunk 0 of validation sequences (which start at the beginning of each video) and discarding the first frames using `--viz-skip`. For a suggestion on the number of frames to skip, take a look at `sync_data` in `data/prepare_data_humaneva.py`. 82 | 83 | Example: 84 | ``` 85 | python run.py -d humaneva15 -k detectron_pt_coco -str Train/S1,Train/S2,Train/S3 -ste Validate/S1,Validate/S2,Validate/S3 -c checkpoint --evaluate pretrained_humaneva15_detectron.bin --render --viz-subject Validate/S2 --viz-action "Walking 1 chunk0" --viz-camera 0 --viz-output output_he.gif --viz-size 3 --viz-downsample 2 --viz-video "/path/to/videos/S2/Walking_1_(C1).avi" --viz-skip 115 --viz-limit 60 86 | ``` 87 | ![](images/demo_humaneva.gif) 88 | 89 | Unlabeled videos are easier to visualize because they do not require synchronization with the ground truth. In this case, visualization works out of the box even for HumanEva. 90 | 91 | Example: 92 | ``` 93 | python run.py -d humaneva15 -k detectron_pt_coco -str Train/S1,Train/S2,Train/S3 -ste Validate/S1,Validate/S2,Validate/S3 -c checkpoint --evaluate pretrained_humaneva15_detectron.bin --render --viz-subject Unlabeled/S4 --viz-action "Box 2" --viz-camera 0 --viz-output output_he.gif --viz-size 3 --viz-downsample 2 --viz-video "/path/to/videos/S4/Box_2_(C1).avi" --viz-limit 60 94 | ``` 95 | ![](images/demo_humaneva_unlabeled.gif) 96 | 97 | ## Implementation details 98 | ### Batch generation during training 99 | Some details of our training procedure are better understood visually. 100 | ![](images/batching.png) 101 | The figure above shows how training batches are generated, depending on the value of `--stride` (from left to right: 1, 2, and 4). This example shows a sequence of 2D poses which has a length of N = 8 frames. The 3D poses (blue boxes in the figure) are inferred using a model that has a receptive field F = 5 frames. Therefore, because of valid padding, an input sequence of length N results in an output sequence of length N - F + 1, i.e. N - 4 in this example. 102 | 103 | When `--stride=1`, we generate one training example for each frame. This ensures that the batches are maximally uncorrelated, which helps batch normalization as well as generalization. As `--stride` increases, training becomes faster because the model can reutilize intermediate computations, at the cost of biased batch statistics. However, we provide an optimized implementation when `--stride=1`, which replaces dilated convolutions with strided convolutions (only while training), so in principle you should not touch this parameter unless you want to run specific experiments. To understand how it works, see the figures below: 104 | 105 | ![](images/convolutions_1f_naive.png) 106 | The figure above shows the information flow for a model with a receptive field of 27 frames, and a single-frame prediction, i.e. from N = 27 input frames we end up with one output frame. You can observe that this regular implementation tends to waste some intermediate results when a small number of frames are predicted. However, for inference of very long sequences, this approach is very efficient as intermediate results are shared among successive frames. 107 | 108 | ![](images/convolutions_1f_optimized.png) 109 | Therefore, for training *only*, we use the implementation above, which replaces dilated convolutions with strided convolutions. It achieves the same result, but avoids computing unnecessary intermediate results. 110 | 111 | ### Symmetric convolutions vs causal convolutions 112 | The figures below show the information flow from input (bottom) to output (top). In this example, we adopt a model with a receptive field of 27 frames. 113 | 114 | ![](images/convolutions_normal.png) 115 | With symmetric convolutions, both past and future information is exploited, resulting in a better reconstruction. 116 | 117 | ![](images/convolutions_causal.png) 118 | With causal convolutions, only past data is exploited. This approach is suited to real-time applications where future data cannot be exploited, at the cost of a slightly higher error. -------------------------------------------------------------------------------- /INFERENCE.md: -------------------------------------------------------------------------------- 1 | # Inference in the wild 2 | 3 | **Update:** we have added support for Detectron2. 4 | 5 | In this short tutorial, we show how to run our model on arbitrary videos and visualize the predictions. Note that this feature is only provided for experimentation/research purposes and presents some limitations, as this repository is meant to provide a reference implementation of the approach described in the paper (not production-ready code for inference in the wild). 6 | 7 | Our script assumes that a video depicts *exactly* one person. In case of multiple people visible at once, the script will select the person corresponding to the bounding box with the highest confidence, which may cause glitches. 8 | 9 | The instructions below show how to use Detectron to infer 2D keypoints from videos, convert them to a custom dataset for our code, and infer 3D poses. For now, we do not have instructions for CPN. In the last section of this tutorial, we also provide some tips. 10 | 11 | ## Step 1: setup 12 | The inference script requires `ffmpeg`, which you can easily install via conda, pip, or manually. 13 | 14 | Download the [pretrained model](https://dl.fbaipublicfiles.com/video-pose-3d/pretrained_h36m_detectron_coco.bin) for generating 3D predictions. This model is different than the pretrained ones listed in the main README, as it expects input keypoints in COCO format (generated by the pretrained Detectron model) and outputs 3D joint positions in Human3.6M format. Put this model in the `checkpoint` directory of this repo. 15 | 16 | **Note:** if you had downloaded `d-pt-243.bin`, you should download the new pretrained model using the link above. `d-pt-243.bin` takes the keypoint probabilities as input (in addition to the x, y coordinates), which causes problems on videos with a different resolution than that of Human3.6M. The new model is only trained on 2D coordinates and works with any resolution/aspect ratio. 17 | 18 | ## Step 2 (optional): video preprocessing 19 | Since the script expects a single-person scenario, you may want to extract a portion of your video. This is very easy to do with ffmpeg, e.g. 20 | ``` 21 | ffmpeg -i input.mp4 -ss 1:00 -to 1:30 -c copy output.mp4 22 | ``` 23 | extracts a clip from minute 1:00 to minute 1:30 of `input.mp4`, and exports it to `output.mp4`. 24 | 25 | Optionally, you can also adapt the frame rate of the video. Most videos have a frame rate of about 25 FPS, but our Human3.6M model was trained on 50-FPS videos. Since our model is robust to alterations in speed, this step is not very important and can be skipped, but if you want the best possible results you can use ffmpeg again for this task: 26 | ``` 27 | ffmpeg -i input.mp4 -filter "minterpolate='fps=50'" -crf 0 output.mp4 28 | ``` 29 | 30 | ## Step 3: inferring 2D keypoints with Detectron 31 | 32 | ### Using Detectron2 (new) 33 | Set up [Detectron2](https://github.com/facebookresearch/detectron2) and use the script `inference/infer_video_d2.py` (no need to copy this, as it directly uses the Detectron2 API). This script provides a convenient interface to generate 2D keypoint predictions from videos without manually extracting individual frames. 34 | 35 | To infer keypoints from all the mp4 videos in `input_directory`, run 36 | ``` 37 | cd inference 38 | python infer_video_d2.py \ 39 | --cfg COCO-Keypoints/keypoint_rcnn_R_101_FPN_3x.yaml \ 40 | --output-dir output_directory \ 41 | --image-ext mp4 \ 42 | input_directory 43 | ``` 44 | The results will be exported to `output_directory` as custom NumPy archives (`.npz` files). You can change the video extension in `--image-ext` (ffmpeg supports a wide range of formats). 45 | 46 | **Note:** although the architecture is the same (ResNet-101), the weights used by the Detectron2 model are not the same as those used by Detectron1. Since our pretrained model was trained on Detectron1 poses, the result might be slightly different (but it should still be pretty close). 47 | 48 | ### Using Detectron1 (old instructions) 49 | Set up [Detectron](https://github.com/facebookresearch/Detectron) and copy the script `inference/infer_video.py` from this repo to the `tools` directory of the Detectron repo. This script provides a convenient interface to generate 2D keypoint predictions from videos without manually extracting individual frames. 50 | 51 | Our Detectron script `infer_video.py` is a simple adaptation of `infer_simple.py` (which works on images) and has a similar command-line syntax. 52 | 53 | To infer keypoints from all the mp4 videos in `input_directory`, run 54 | ``` 55 | python tools/infer_video.py \ 56 | --cfg configs/12_2017_baselines/e2e_keypoint_rcnn_R-101-FPN_s1x.yaml \ 57 | --output-dir output_directory \ 58 | --image-ext mp4 \ 59 | --wts https://dl.fbaipublicfiles.com/detectron/37698009/12_2017_baselines/e2e_keypoint_rcnn_R-101-FPN_s1x.yaml.08_45_57.YkrJgP6O/output/train/keypoints_coco_2014_train:keypoints_coco_2014_valminusminival/generalized_rcnn/model_final.pkl \ 60 | input_directory 61 | ``` 62 | The results will be exported to `output_directory` as custom NumPy archives (`.npz` files). You can change the video extension in `--image-ext` (ffmpeg supports a wide range of formats). 63 | 64 | ## Step 4: creating a custom dataset 65 | Run our dataset preprocessing script from the `data` directory: 66 | ``` 67 | python prepare_data_2d_custom.py -i /path/to/detections/output_directory -o myvideos 68 | ``` 69 | This creates a custom dataset named `myvideos` (which contains all the videos in `output_directory`, each of which is mapped to a different subject) and saved to `data_2d_custom_myvideos.npz`. You are free to specify any name for the dataset. 70 | 71 | **Note:** as mentioned, the script will take the bounding box with the highest probability for each frame. If a particular frame has no bounding boxes, it is assumed to be a missed detection and the keypoints will be interpolated from neighboring frames. 72 | 73 | ## Step 5: rendering a custom video and exporting coordinates 74 | You can finally use the visualization feature to render a video of the 3D joint predictions. You must specify the `custom` dataset (`-d custom`), the input keypoints as exported in the previous step (`-k myvideos`), the correct architecture/checkpoint, and the action `custom` (`--viz-action custom`). The subject is the file name of the input video, and the camera is always 0. 75 | ``` 76 | python run.py -d custom -k myvideos -arc 3,3,3,3,3 -c checkpoint --evaluate pretrained_h36m_detectron_coco.bin --render --viz-subject input_video.mp4 --viz-action custom --viz-camera 0 --viz-video /path/to/input_video.mp4 --viz-output output.mp4 --viz-size 6 77 | ``` 78 | 79 | You can also export the 3D joint positions (in camera space) to a NumPy archive. To this end, replace `--viz-output` with `--viz-export` and specify the file name. 80 | 81 | ## Limitations and tips 82 | - The model was trained on Human3.6M cameras (which are relatively undistorted), and the results may be bad if the intrinsic parameters of the cameras of your videos differ much from those of Human3.6M. This may be particularly noticeable with fisheye cameras, which present a high degree of non-linear lens distortion. If the camera parameters are known, consider preprocessing your videos to match those of Human3.6M as closely as possible. 83 | - If you want multi-person tracking, you should implement a bounding box matching strategy. An example would be to use bipartite matching on the bounding box overlap (IoU) between subsequent frames, but there are many other approaches. 84 | - Predictions are relative to the root joint, i.e. the global trajectory is not regressed. If you need it, you may want to use another model to regress it, such as the one we use for semi-supervision. 85 | - Predictions are always in *camera space* (regardless of whether the trajectory is available). For our visualization script, we simply take a random camera from Human3.6M, which fits decently most videos where the camera viewport is parallel to the ground. -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Attribution-NonCommercial 4.0 International 2 | 3 | ======================================================================= 4 | 5 | Creative Commons Corporation ("Creative Commons") is not a law firm and 6 | does not provide legal services or legal advice. Distribution of 7 | Creative Commons public licenses does not create a lawyer-client or 8 | other relationship. Creative Commons makes its licenses and related 9 | information available on an "as-is" basis. Creative Commons gives no 10 | warranties regarding its licenses, any material licensed under their 11 | terms and conditions, or any related information. Creative Commons 12 | disclaims all liability for damages resulting from their use to the 13 | fullest extent possible. 14 | 15 | Using Creative Commons Public Licenses 16 | 17 | Creative Commons public licenses provide a standard set of terms and 18 | conditions that creators and other rights holders may use to share 19 | original works of authorship and other material subject to copyright 20 | and certain other rights specified in the public license below. The 21 | following considerations are for informational purposes only, are not 22 | exhaustive, and do not form part of our licenses. 23 | 24 | Considerations for licensors: Our public licenses are 25 | intended for use by those authorized to give the public 26 | permission to use material in ways otherwise restricted by 27 | copyright and certain other rights. Our licenses are 28 | irrevocable. Licensors should read and understand the terms 29 | and conditions of the license they choose before applying it. 30 | Licensors should also secure all rights necessary before 31 | applying our licenses so that the public can reuse the 32 | material as expected. Licensors should clearly mark any 33 | material not subject to the license. This includes other CC- 34 | licensed material, or material used under an exception or 35 | limitation to copyright. More considerations for licensors: 36 | wiki.creativecommons.org/Considerations_for_licensors 37 | 38 | Considerations for the public: By using one of our public 39 | licenses, a licensor grants the public permission to use the 40 | licensed material under specified terms and conditions. If 41 | the licensor's permission is not necessary for any reason--for 42 | example, because of any applicable exception or limitation to 43 | copyright--then that use is not regulated by the license. Our 44 | licenses grant only permissions under copyright and certain 45 | other rights that a licensor has authority to grant. Use of 46 | the licensed material may still be restricted for other 47 | reasons, including because others have copyright or other 48 | rights in the material. A licensor may make special requests, 49 | such as asking that all changes be marked or described. 50 | Although not required by our licenses, you are encouraged to 51 | respect those requests where reasonable. More_considerations 52 | for the public: 53 | wiki.creativecommons.org/Considerations_for_licensees 54 | 55 | ======================================================================= 56 | 57 | Creative Commons Attribution-NonCommercial 4.0 International Public 58 | License 59 | 60 | By exercising the Licensed Rights (defined below), You accept and agree 61 | to be bound by the terms and conditions of this Creative Commons 62 | Attribution-NonCommercial 4.0 International Public License ("Public 63 | License"). To the extent this Public License may be interpreted as a 64 | contract, You are granted the Licensed Rights in consideration of Your 65 | acceptance of these terms and conditions, and the Licensor grants You 66 | such rights in consideration of benefits the Licensor receives from 67 | making the Licensed Material available under these terms and 68 | conditions. 69 | 70 | Section 1 -- Definitions. 71 | 72 | a. Adapted Material means material subject to Copyright and Similar 73 | Rights that is derived from or based upon the Licensed Material 74 | and in which the Licensed Material is translated, altered, 75 | arranged, transformed, or otherwise modified in a manner requiring 76 | permission under the Copyright and Similar Rights held by the 77 | Licensor. For purposes of this Public License, where the Licensed 78 | Material is a musical work, performance, or sound recording, 79 | Adapted Material is always produced where the Licensed Material is 80 | synched in timed relation with a moving image. 81 | 82 | b. Adapter's License means the license You apply to Your Copyright 83 | and Similar Rights in Your contributions to Adapted Material in 84 | accordance with the terms and conditions of this Public License. 85 | 86 | c. Copyright and Similar Rights means copyright and/or similar rights 87 | closely related to copyright including, without limitation, 88 | performance, broadcast, sound recording, and Sui Generis Database 89 | Rights, without regard to how the rights are labeled or 90 | categorized. For purposes of this Public License, the rights 91 | specified in Section 2(b)(1)-(2) are not Copyright and Similar 92 | Rights. 93 | d. Effective Technological Measures means those measures that, in the 94 | absence of proper authority, may not be circumvented under laws 95 | fulfilling obligations under Article 11 of the WIPO Copyright 96 | Treaty adopted on December 20, 1996, and/or similar international 97 | agreements. 98 | 99 | e. Exceptions and Limitations means fair use, fair dealing, and/or 100 | any other exception or limitation to Copyright and Similar Rights 101 | that applies to Your use of the Licensed Material. 102 | 103 | f. Licensed Material means the artistic or literary work, database, 104 | or other material to which the Licensor applied this Public 105 | License. 106 | 107 | g. Licensed Rights means the rights granted to You subject to the 108 | terms and conditions of this Public License, which are limited to 109 | all Copyright and Similar Rights that apply to Your use of the 110 | Licensed Material and that the Licensor has authority to license. 111 | 112 | h. Licensor means the individual(s) or entity(ies) granting rights 113 | under this Public License. 114 | 115 | i. NonCommercial means not primarily intended for or directed towards 116 | commercial advantage or monetary compensation. For purposes of 117 | this Public License, the exchange of the Licensed Material for 118 | other material subject to Copyright and Similar Rights by digital 119 | file-sharing or similar means is NonCommercial provided there is 120 | no payment of monetary compensation in connection with the 121 | exchange. 122 | 123 | j. Share means to provide material to the public by any means or 124 | process that requires permission under the Licensed Rights, such 125 | as reproduction, public display, public performance, distribution, 126 | dissemination, communication, or importation, and to make material 127 | available to the public including in ways that members of the 128 | public may access the material from a place and at a time 129 | individually chosen by them. 130 | 131 | k. Sui Generis Database Rights means rights other than copyright 132 | resulting from Directive 96/9/EC of the European Parliament and of 133 | the Council of 11 March 1996 on the legal protection of databases, 134 | as amended and/or succeeded, as well as other essentially 135 | equivalent rights anywhere in the world. 136 | 137 | l. You means the individual or entity exercising the Licensed Rights 138 | under this Public License. Your has a corresponding meaning. 139 | 140 | Section 2 -- Scope. 141 | 142 | a. License grant. 143 | 144 | 1. Subject to the terms and conditions of this Public License, 145 | the Licensor hereby grants You a worldwide, royalty-free, 146 | non-sublicensable, non-exclusive, irrevocable license to 147 | exercise the Licensed Rights in the Licensed Material to: 148 | 149 | a. reproduce and Share the Licensed Material, in whole or 150 | in part, for NonCommercial purposes only; and 151 | 152 | b. produce, reproduce, and Share Adapted Material for 153 | NonCommercial purposes only. 154 | 155 | 2. Exceptions and Limitations. For the avoidance of doubt, where 156 | Exceptions and Limitations apply to Your use, this Public 157 | License does not apply, and You do not need to comply with 158 | its terms and conditions. 159 | 160 | 3. Term. The term of this Public License is specified in Section 161 | 6(a). 162 | 163 | 4. Media and formats; technical modifications allowed. The 164 | Licensor authorizes You to exercise the Licensed Rights in 165 | all media and formats whether now known or hereafter created, 166 | and to make technical modifications necessary to do so. The 167 | Licensor waives and/or agrees not to assert any right or 168 | authority to forbid You from making technical modifications 169 | necessary to exercise the Licensed Rights, including 170 | technical modifications necessary to circumvent Effective 171 | Technological Measures. For purposes of this Public License, 172 | simply making modifications authorized by this Section 2(a) 173 | (4) never produces Adapted Material. 174 | 175 | 5. Downstream recipients. 176 | 177 | a. Offer from the Licensor -- Licensed Material. Every 178 | recipient of the Licensed Material automatically 179 | receives an offer from the Licensor to exercise the 180 | Licensed Rights under the terms and conditions of this 181 | Public License. 182 | 183 | b. No downstream restrictions. You may not offer or impose 184 | any additional or different terms or conditions on, or 185 | apply any Effective Technological Measures to, the 186 | Licensed Material if doing so restricts exercise of the 187 | Licensed Rights by any recipient of the Licensed 188 | Material. 189 | 190 | 6. No endorsement. Nothing in this Public License constitutes or 191 | may be construed as permission to assert or imply that You 192 | are, or that Your use of the Licensed Material is, connected 193 | with, or sponsored, endorsed, or granted official status by, 194 | the Licensor or others designated to receive attribution as 195 | provided in Section 3(a)(1)(A)(i). 196 | 197 | b. Other rights. 198 | 199 | 1. Moral rights, such as the right of integrity, are not 200 | licensed under this Public License, nor are publicity, 201 | privacy, and/or other similar personality rights; however, to 202 | the extent possible, the Licensor waives and/or agrees not to 203 | assert any such rights held by the Licensor to the limited 204 | extent necessary to allow You to exercise the Licensed 205 | Rights, but not otherwise. 206 | 207 | 2. Patent and trademark rights are not licensed under this 208 | Public License. 209 | 210 | 3. To the extent possible, the Licensor waives any right to 211 | collect royalties from You for the exercise of the Licensed 212 | Rights, whether directly or through a collecting society 213 | under any voluntary or waivable statutory or compulsory 214 | licensing scheme. In all other cases the Licensor expressly 215 | reserves any right to collect such royalties, including when 216 | the Licensed Material is used other than for NonCommercial 217 | purposes. 218 | 219 | Section 3 -- License Conditions. 220 | 221 | Your exercise of the Licensed Rights is expressly made subject to the 222 | following conditions. 223 | 224 | a. Attribution. 225 | 226 | 1. If You Share the Licensed Material (including in modified 227 | form), You must: 228 | 229 | a. retain the following if it is supplied by the Licensor 230 | with the Licensed Material: 231 | 232 | i. identification of the creator(s) of the Licensed 233 | Material and any others designated to receive 234 | attribution, in any reasonable manner requested by 235 | the Licensor (including by pseudonym if 236 | designated); 237 | 238 | ii. a copyright notice; 239 | 240 | iii. a notice that refers to this Public License; 241 | 242 | iv. a notice that refers to the disclaimer of 243 | warranties; 244 | 245 | v. a URI or hyperlink to the Licensed Material to the 246 | extent reasonably practicable; 247 | 248 | b. indicate if You modified the Licensed Material and 249 | retain an indication of any previous modifications; and 250 | 251 | c. indicate the Licensed Material is licensed under this 252 | Public License, and include the text of, or the URI or 253 | hyperlink to, this Public License. 254 | 255 | 2. You may satisfy the conditions in Section 3(a)(1) in any 256 | reasonable manner based on the medium, means, and context in 257 | which You Share the Licensed Material. For example, it may be 258 | reasonable to satisfy the conditions by providing a URI or 259 | hyperlink to a resource that includes the required 260 | information. 261 | 262 | 3. If requested by the Licensor, You must remove any of the 263 | information required by Section 3(a)(1)(A) to the extent 264 | reasonably practicable. 265 | 266 | 4. If You Share Adapted Material You produce, the Adapter's 267 | License You apply must not prevent recipients of the Adapted 268 | Material from complying with this Public License. 269 | 270 | Section 4 -- Sui Generis Database Rights. 271 | 272 | Where the Licensed Rights include Sui Generis Database Rights that 273 | apply to Your use of the Licensed Material: 274 | 275 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right 276 | to extract, reuse, reproduce, and Share all or a substantial 277 | portion of the contents of the database for NonCommercial purposes 278 | only; 279 | 280 | b. if You include all or a substantial portion of the database 281 | contents in a database in which You have Sui Generis Database 282 | Rights, then the database in which You have Sui Generis Database 283 | Rights (but not its individual contents) is Adapted Material; and 284 | 285 | c. You must comply with the conditions in Section 3(a) if You Share 286 | all or a substantial portion of the contents of the database. 287 | 288 | For the avoidance of doubt, this Section 4 supplements and does not 289 | replace Your obligations under this Public License where the Licensed 290 | Rights include other Copyright and Similar Rights. 291 | 292 | Section 5 -- Disclaimer of Warranties and Limitation of Liability. 293 | 294 | a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE 295 | EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS 296 | AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF 297 | ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS, 298 | IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION, 299 | WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR 300 | PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, 301 | ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT 302 | KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT 303 | ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU. 304 | 305 | b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE 306 | TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, 307 | NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, 308 | INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, 309 | COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR 310 | USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN 311 | ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR 312 | DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR 313 | IN PART, THIS LIMITATION MAY NOT APPLY TO YOU. 314 | 315 | c. The disclaimer of warranties and limitation of liability provided 316 | above shall be interpreted in a manner that, to the extent 317 | possible, most closely approximates an absolute disclaimer and 318 | waiver of all liability. 319 | 320 | Section 6 -- Term and Termination. 321 | 322 | a. This Public License applies for the term of the Copyright and 323 | Similar Rights licensed here. However, if You fail to comply with 324 | this Public License, then Your rights under this Public License 325 | terminate automatically. 326 | 327 | b. Where Your right to use the Licensed Material has terminated under 328 | Section 6(a), it reinstates: 329 | 330 | 1. automatically as of the date the violation is cured, provided 331 | it is cured within 30 days of Your discovery of the 332 | violation; or 333 | 334 | 2. upon express reinstatement by the Licensor. 335 | 336 | For the avoidance of doubt, this Section 6(b) does not affect any 337 | right the Licensor may have to seek remedies for Your violations 338 | of this Public License. 339 | 340 | c. For the avoidance of doubt, the Licensor may also offer the 341 | Licensed Material under separate terms or conditions or stop 342 | distributing the Licensed Material at any time; however, doing so 343 | will not terminate this Public License. 344 | 345 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public 346 | License. 347 | 348 | Section 7 -- Other Terms and Conditions. 349 | 350 | a. The Licensor shall not be bound by any additional or different 351 | terms or conditions communicated by You unless expressly agreed. 352 | 353 | b. Any arrangements, understandings, or agreements regarding the 354 | Licensed Material not stated herein are separate from and 355 | independent of the terms and conditions of this Public License. 356 | 357 | Section 8 -- Interpretation. 358 | 359 | a. For the avoidance of doubt, this Public License does not, and 360 | shall not be interpreted to, reduce, limit, restrict, or impose 361 | conditions on any use of the Licensed Material that could lawfully 362 | be made without permission under this Public License. 363 | 364 | b. To the extent possible, if any provision of this Public License is 365 | deemed unenforceable, it shall be automatically reformed to the 366 | minimum extent necessary to make it enforceable. If the provision 367 | cannot be reformed, it shall be severed from this Public License 368 | without affecting the enforceability of the remaining terms and 369 | conditions. 370 | 371 | c. No term or condition of this Public License will be waived and no 372 | failure to comply consented to unless expressly agreed to by the 373 | Licensor. 374 | 375 | d. Nothing in this Public License constitutes or may be interpreted 376 | as a limitation upon, or waiver of, any privileges and immunities 377 | that apply to the Licensor or You, including from the legal 378 | processes of any jurisdiction or authority. 379 | 380 | ======================================================================= 381 | 382 | Creative Commons is not a party to its public 383 | licenses. Notwithstanding, Creative Commons may elect to apply one of 384 | its public licenses to material it publishes and in those instances 385 | will be considered the “Licensor.” The text of the Creative Commons 386 | public licenses is dedicated to the public domain under the CC0 Public 387 | Domain Dedication. Except for the limited purpose of indicating that 388 | material is shared under a Creative Commons public license or as 389 | otherwise permitted by the Creative Commons policies published at 390 | creativecommons.org/policies, Creative Commons does not authorize the 391 | use of the trademark "Creative Commons" or any other trademark or logo 392 | of Creative Commons without its prior written consent including, 393 | without limitation, in connection with any unauthorized modifications 394 | to any of its public licenses or any other arrangements, 395 | understandings, or agreements concerning use of licensed material. For 396 | the avoidance of doubt, this paragraph does not form part of the 397 | public licenses. 398 | 399 | Creative Commons may be contacted at creativecommons.org. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 3D human pose estimation in video with temporal convolutions and semi-supervised training 2 |

3 | 4 | This is the implementation of the approach described in the paper: 5 | > Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. [3D human pose estimation in video with temporal convolutions and semi-supervised training](https://arxiv.org/abs/1811.11742). In Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 6 | 7 | More demos are available at https://dariopavllo.github.io/VideoPose3D 8 | 9 |

10 | 11 | ![](images/demo_temporal.gif) 12 | 13 | ### Results on Human3.6M 14 | Under Protocol 1 (mean per-joint position error) and Protocol 2 (mean-per-joint position error after rigid alignment). 15 | 16 | | 2D Detections | BBoxes | Blocks | Receptive Field | Error (P1) | Error (P2) | 17 | |:-------|:-------:|:-------:|:-------:|:-------:|:-------:| 18 | | CPN | Mask R-CNN | 4 | 243 frames | **46.8 mm** | **36.5 mm** | 19 | | CPN | Ground truth | 4 | 243 frames | 47.1 mm | 36.8 mm | 20 | | CPN | Ground truth | 3 | 81 frames | 47.7 mm | 37.2 mm | 21 | | CPN | Ground truth | 2 | 27 frames | 48.8 mm | 38.0 mm | 22 | | Mask R-CNN | Mask R-CNN | 4 | 243 frames | 51.6 mm | 40.3 mm | 23 | | Ground truth | -- | 4 | 243 frames | 37.2 mm | 27.2 mm | 24 | 25 | ## Quick start 26 | To get started as quickly as possible, follow the instructions in this section. This should allow you train a model from scratch, test our pretrained models, and produce basic visualizations. For more detailed instructions, please refer to [`DOCUMENTATION.md`](DOCUMENTATION.md). 27 | 28 | ### Dependencies 29 | Make sure you have the following dependencies installed before proceeding: 30 | - Python 3+ distribution 31 | - PyTorch >= 0.4.0 32 | 33 | Optional: 34 | - Matplotlib, if you want to visualize predictions. Additionally, you need *ffmpeg* to export MP4 videos, and *imagemagick* to export GIFs. 35 | - MATLAB, if you want to experiment with HumanEva-I (you need this to convert the dataset). 36 | 37 | ### Dataset setup 38 | You can find the instructions for setting up the Human3.6M and HumanEva-I datasets in [`DATASETS.md`](DATASETS.md). For this short guide, we focus on Human3.6M. You are not required to setup HumanEva, unless you want to experiment with it. 39 | 40 | In order to proceed, you must also copy CPN detections (for Human3.6M) and/or Mask R-CNN detections (for HumanEva). 41 | 42 | ### Evaluating our pretrained models 43 | The pretrained models can be downloaded from AWS. Put `pretrained_h36m_cpn.bin` (for Human3.6M) and/or `pretrained_humaneva15_detectron.bin` (for HumanEva) in the `checkpoint/` directory (create it if it does not exist). 44 | ```sh 45 | mkdir checkpoint 46 | cd checkpoint 47 | wget https://dl.fbaipublicfiles.com/video-pose-3d/pretrained_h36m_cpn.bin 48 | wget https://dl.fbaipublicfiles.com/video-pose-3d/pretrained_humaneva15_detectron.bin 49 | cd .. 50 | ``` 51 | 52 | These models allow you to reproduce our top-performing baselines, which are: 53 | - 46.8 mm for Human3.6M, using fine-tuned CPN detections, bounding boxes from Mask R-CNN, and an architecture with a receptive field of 243 frames. 54 | - 33.0 mm for HumanEva-I (on 3 actions), using pretrained Mask R-CNN detections, and an architecture with a receptive field of 27 frames. This is the multi-action model trained on 3 actions (Walk, Jog, Box). 55 | 56 | To test on Human3.6M, run: 57 | ``` 58 | python run.py -k cpn_ft_h36m_dbb -arc 3,3,3,3,3 -c checkpoint --evaluate pretrained_h36m_cpn.bin 59 | ``` 60 | 61 | To test on HumanEva, run: 62 | ``` 63 | python run.py -d humaneva15 -k detectron_pt_coco -str Train/S1,Train/S2,Train/S3 -ste Validate/S1,Validate/S2,Validate/S3 -a Walk,Jog,Box --by-subject -c checkpoint --evaluate pretrained_humaneva15_detectron.bin 64 | ``` 65 | 66 | [`DOCUMENTATION.md`](DOCUMENTATION.md) provides a precise description of all command-line arguments. 67 | 68 | ### Inference in the wild 69 | We have introduced an experimental feature to run our model on custom videos. See [`INFERENCE.md`](INFERENCE.md) for more details. 70 | 71 | ### Training from scratch 72 | If you want to reproduce the results of our pretrained models, run the following commands. 73 | 74 | For Human3.6M: 75 | ``` 76 | python run.py -e 80 -k cpn_ft_h36m_dbb -arc 3,3,3,3,3 77 | ``` 78 | By default the application runs in training mode. This will train a new model for 80 epochs, using fine-tuned CPN detections. Expect a training time of 24 hours on a high-end Pascal GPU. If you feel that this is too much, or your GPU is not powerful enough, you can train a model with a smaller receptive field, e.g. 79 | - `-arc 3,3,3,3` (81 frames) should require 11 hours and achieve 47.7 mm. 80 | - `-arc 3,3,3` (27 frames) should require 6 hours and achieve 48.8 mm. 81 | 82 | You could also lower the number of epochs from 80 to 60 with a negligible impact on the result. 83 | 84 | For HumanEva: 85 | ``` 86 | python run.py -d humaneva15 -k detectron_pt_coco -str Train/S1,Train/S2,Train/S3 -ste Validate/S1,Validate/S2,Validate/S3 -b 128 -e 1000 -lrd 0.996 -a Walk,Jog,Box --by-subject 87 | ``` 88 | This will train for 1000 epochs, using Mask R-CNN detections and evaluating each subject separately. 89 | Since HumanEva is much smaller than Human3.6M, training should require about 50 minutes. 90 | 91 | ### Semi-supervised training 92 | To perform semi-supervised training, you just need to add the `--subjects-unlabeled` argument. In the example below, we use ground-truth 2D poses as input, and train supervised on just 10% of Subject 1 (specified by `--subset 0.1`). The remaining subjects are treated as unlabeled data and are used for semi-supervision. 93 | ``` 94 | python run.py -k gt --subjects-train S1 --subset 0.1 --subjects-unlabeled S5,S6,S7,S8 -e 200 -lrd 0.98 -arc 3,3,3 --warmup 5 -b 64 95 | ``` 96 | This should give you an error around 65.2 mm. By contrast, if we only train supervised 97 | ``` 98 | python run.py -k gt --subjects-train S1 --subset 0.1 -e 200 -lrd 0.98 -arc 3,3,3 -b 64 99 | ``` 100 | we get around 80.7 mm, which is significantly higher. 101 | 102 | ### Visualization 103 | If you have the original Human3.6M videos, you can generate nice visualizations of the model predictions. For instance: 104 | ``` 105 | python run.py -k cpn_ft_h36m_dbb -arc 3,3,3,3,3 -c checkpoint --evaluate pretrained_h36m_cpn.bin --render --viz-subject S11 --viz-action Walking --viz-camera 0 --viz-video "/path/to/videos/S11/Videos/Walking.54138969.mp4" --viz-output output.gif --viz-size 3 --viz-downsample 2 --viz-limit 60 106 | ``` 107 | The script can also export MP4 videos, and supports a variety of parameters (e.g. downsampling/FPS, size, bitrate). See [`DOCUMENTATION.md`](DOCUMENTATION.md) for more details. 108 | 109 | ## License 110 | This work is licensed under CC BY-NC. See LICENSE for details. Third-party datasets are subject to their respective licenses. 111 | If you use our code/models in your research, please cite our paper: 112 | ``` 113 | @inproceedings{pavllo:videopose3d:2019, 114 | title={3D human pose estimation in video with temporal convolutions and semi-supervised training}, 115 | author={Pavllo, Dario and Feichtenhofer, Christoph and Grangier, David and Auli, Michael}, 116 | booktitle={Conference on Computer Vision and Pattern Recognition (CVPR)}, 117 | year={2019} 118 | } 119 | ``` 120 | -------------------------------------------------------------------------------- /common/arguments.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2018-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | # 7 | 8 | import argparse 9 | 10 | def parse_args(): 11 | parser = argparse.ArgumentParser(description='Training script') 12 | 13 | # General arguments 14 | parser.add_argument('-d', '--dataset', default='h36m', type=str, metavar='NAME', help='target dataset') # h36m or humaneva 15 | parser.add_argument('-k', '--keypoints', default='cpn_ft_h36m_dbb', type=str, metavar='NAME', help='2D detections to use') 16 | parser.add_argument('-str', '--subjects-train', default='S1,S5,S6,S7,S8', type=str, metavar='LIST', 17 | help='training subjects separated by comma') 18 | parser.add_argument('-ste', '--subjects-test', default='S9,S11', type=str, metavar='LIST', help='test subjects separated by comma') 19 | parser.add_argument('-sun', '--subjects-unlabeled', default='', type=str, metavar='LIST', 20 | help='unlabeled subjects separated by comma for self-supervision') 21 | parser.add_argument('-a', '--actions', default='*', type=str, metavar='LIST', 22 | help='actions to train/test on, separated by comma, or * for all') 23 | parser.add_argument('-c', '--checkpoint', default='checkpoint', type=str, metavar='PATH', 24 | help='checkpoint directory') 25 | parser.add_argument('--checkpoint-frequency', default=10, type=int, metavar='N', 26 | help='create a checkpoint every N epochs') 27 | parser.add_argument('-r', '--resume', default='', type=str, metavar='FILENAME', 28 | help='checkpoint to resume (file name)') 29 | parser.add_argument('--evaluate', default='', type=str, metavar='FILENAME', help='checkpoint to evaluate (file name)') 30 | parser.add_argument('--render', action='store_true', help='visualize a particular video') 31 | parser.add_argument('--by-subject', action='store_true', help='break down error by subject (on evaluation)') 32 | parser.add_argument('--export-training-curves', action='store_true', help='save training curves as .png images') 33 | 34 | # Model arguments 35 | parser.add_argument('-s', '--stride', default=1, type=int, metavar='N', help='chunk size to use during training') 36 | parser.add_argument('-e', '--epochs', default=60, type=int, metavar='N', help='number of training epochs') 37 | parser.add_argument('-b', '--batch-size', default=1024, type=int, metavar='N', help='batch size in terms of predicted frames') 38 | parser.add_argument('-drop', '--dropout', default=0.25, type=float, metavar='P', help='dropout probability') 39 | parser.add_argument('-lr', '--learning-rate', default=0.001, type=float, metavar='LR', help='initial learning rate') 40 | parser.add_argument('-lrd', '--lr-decay', default=0.95, type=float, metavar='LR', help='learning rate decay per epoch') 41 | parser.add_argument('-no-da', '--no-data-augmentation', dest='data_augmentation', action='store_false', 42 | help='disable train-time flipping') 43 | parser.add_argument('-no-tta', '--no-test-time-augmentation', dest='test_time_augmentation', action='store_false', 44 | help='disable test-time flipping') 45 | parser.add_argument('-arc', '--architecture', default='3,3,3', type=str, metavar='LAYERS', help='filter widths separated by comma') 46 | parser.add_argument('--causal', action='store_true', help='use causal convolutions for real-time processing') 47 | parser.add_argument('-ch', '--channels', default=1024, type=int, metavar='N', help='number of channels in convolution layers') 48 | 49 | # Experimental 50 | parser.add_argument('--subset', default=1, type=float, metavar='FRACTION', help='reduce dataset size by fraction') 51 | parser.add_argument('--downsample', default=1, type=int, metavar='FACTOR', help='downsample frame rate by factor (semi-supervised)') 52 | parser.add_argument('--warmup', default=1, type=int, metavar='N', help='warm-up epochs for semi-supervision') 53 | parser.add_argument('--no-eval', action='store_true', help='disable epoch evaluation while training (small speed-up)') 54 | parser.add_argument('--dense', action='store_true', help='use dense convolutions instead of dilated convolutions') 55 | parser.add_argument('--disable-optimizations', action='store_true', help='disable optimized model for single-frame predictions') 56 | parser.add_argument('--linear-projection', action='store_true', help='use only linear coefficients for semi-supervised projection') 57 | parser.add_argument('--no-bone-length', action='store_false', dest='bone_length_term', 58 | help='disable bone length term in semi-supervised settings') 59 | parser.add_argument('--no-proj', action='store_true', help='disable projection for semi-supervised setting') 60 | 61 | # Visualization 62 | parser.add_argument('--viz-subject', type=str, metavar='STR', help='subject to render') 63 | parser.add_argument('--viz-action', type=str, metavar='STR', help='action to render') 64 | parser.add_argument('--viz-camera', type=int, default=0, metavar='N', help='camera to render') 65 | parser.add_argument('--viz-video', type=str, metavar='PATH', help='path to input video') 66 | parser.add_argument('--viz-skip', type=int, default=0, metavar='N', help='skip first N frames of input video') 67 | parser.add_argument('--viz-output', type=str, metavar='PATH', help='output file name (.gif or .mp4)') 68 | parser.add_argument('--viz-export', type=str, metavar='PATH', help='output file name for coordinates') 69 | parser.add_argument('--viz-bitrate', type=int, default=3000, metavar='N', help='bitrate for mp4 videos') 70 | parser.add_argument('--viz-no-ground-truth', action='store_true', help='do not show ground-truth poses') 71 | parser.add_argument('--viz-limit', type=int, default=-1, metavar='N', help='only render first N frames') 72 | parser.add_argument('--viz-downsample', type=int, default=1, metavar='N', help='downsample FPS by a factor N') 73 | parser.add_argument('--viz-size', type=int, default=5, metavar='N', help='image size') 74 | 75 | parser.set_defaults(bone_length_term=True) 76 | parser.set_defaults(data_augmentation=True) 77 | parser.set_defaults(test_time_augmentation=True) 78 | 79 | args = parser.parse_args() 80 | # Check invalid configuration 81 | if args.resume and args.evaluate: 82 | print('Invalid flags: --resume and --evaluate cannot be set at the same time') 83 | exit() 84 | 85 | if args.export_training_curves and args.no_eval: 86 | print('Invalid flags: --export-training-curves and --no-eval cannot be set at the same time') 87 | exit() 88 | 89 | return args -------------------------------------------------------------------------------- /common/camera.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2018-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | # 7 | 8 | import numpy as np 9 | import torch 10 | 11 | from common.utils import wrap 12 | from common.quaternion import qrot, qinverse 13 | 14 | def normalize_screen_coordinates(X, w, h): 15 | assert X.shape[-1] == 2 16 | 17 | # Normalize so that [0, w] is mapped to [-1, 1], while preserving the aspect ratio 18 | return X/w*2 - [1, h/w] 19 | 20 | 21 | def image_coordinates(X, w, h): 22 | assert X.shape[-1] == 2 23 | 24 | # Reverse camera frame normalization 25 | return (X + [1, h/w])*w/2 26 | 27 | 28 | def world_to_camera(X, R, t): 29 | Rt = wrap(qinverse, R) # Invert rotation 30 | return wrap(qrot, np.tile(Rt, (*X.shape[:-1], 1)), X - t) # Rotate and translate 31 | 32 | 33 | def camera_to_world(X, R, t): 34 | return wrap(qrot, np.tile(R, (*X.shape[:-1], 1)), X) + t 35 | 36 | 37 | def project_to_2d(X, camera_params): 38 | """ 39 | Project 3D points to 2D using the Human3.6M camera projection function. 40 | This is a differentiable and batched reimplementation of the original MATLAB script. 41 | 42 | Arguments: 43 | X -- 3D points in *camera space* to transform (N, *, 3) 44 | camera_params -- intrinsic parameteres (N, 2+2+3+2=9) 45 | """ 46 | assert X.shape[-1] == 3 47 | assert len(camera_params.shape) == 2 48 | assert camera_params.shape[-1] == 9 49 | assert X.shape[0] == camera_params.shape[0] 50 | 51 | while len(camera_params.shape) < len(X.shape): 52 | camera_params = camera_params.unsqueeze(1) 53 | 54 | f = camera_params[..., :2] 55 | c = camera_params[..., 2:4] 56 | k = camera_params[..., 4:7] 57 | p = camera_params[..., 7:] 58 | 59 | XX = torch.clamp(X[..., :2] / X[..., 2:], min=-1, max=1) 60 | r2 = torch.sum(XX[..., :2]**2, dim=len(XX.shape)-1, keepdim=True) 61 | 62 | radial = 1 + torch.sum(k * torch.cat((r2, r2**2, r2**3), dim=len(r2.shape)-1), dim=len(r2.shape)-1, keepdim=True) 63 | tan = torch.sum(p*XX, dim=len(XX.shape)-1, keepdim=True) 64 | 65 | XXX = XX*(radial + tan) + p*r2 66 | 67 | return f*XXX + c 68 | 69 | def project_to_2d_linear(X, camera_params): 70 | """ 71 | Project 3D points to 2D using only linear parameters (focal length and principal point). 72 | 73 | Arguments: 74 | X -- 3D points in *camera space* to transform (N, *, 3) 75 | camera_params -- intrinsic parameteres (N, 2+2+3+2=9) 76 | """ 77 | assert X.shape[-1] == 3 78 | assert len(camera_params.shape) == 2 79 | assert camera_params.shape[-1] == 9 80 | assert X.shape[0] == camera_params.shape[0] 81 | 82 | while len(camera_params.shape) < len(X.shape): 83 | camera_params = camera_params.unsqueeze(1) 84 | 85 | f = camera_params[..., :2] 86 | c = camera_params[..., 2:4] 87 | 88 | XX = torch.clamp(X[..., :2] / X[..., 2:], min=-1, max=1) 89 | 90 | return f*XX + c -------------------------------------------------------------------------------- /common/custom_dataset.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2018-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | # 7 | 8 | import numpy as np 9 | import copy 10 | from common.skeleton import Skeleton 11 | from common.mocap_dataset import MocapDataset 12 | from common.camera import normalize_screen_coordinates, image_coordinates 13 | from common.h36m_dataset import h36m_skeleton 14 | 15 | 16 | custom_camera_params = { 17 | 'id': None, 18 | 'res_w': None, # Pulled from metadata 19 | 'res_h': None, # Pulled from metadata 20 | 21 | # Dummy camera parameters (taken from Human3.6M), only for visualization purposes 22 | 'azimuth': 70, # Only used for visualization 23 | 'orientation': [0.1407056450843811, -0.1500701755285263, -0.755240797996521, 0.6223280429840088], 24 | 'translation': [1841.1070556640625, 4955.28466796875, 1563.4454345703125], 25 | } 26 | 27 | class CustomDataset(MocapDataset): 28 | def __init__(self, detections_path, remove_static_joints=True): 29 | super().__init__(fps=None, skeleton=h36m_skeleton) 30 | 31 | # Load serialized dataset 32 | data = np.load(detections_path, allow_pickle=True) 33 | resolutions = data['metadata'].item()['video_metadata'] 34 | 35 | self._cameras = {} 36 | self._data = {} 37 | for video_name, res in resolutions.items(): 38 | cam = {} 39 | cam.update(custom_camera_params) 40 | cam['orientation'] = np.array(cam['orientation'], dtype='float32') 41 | cam['translation'] = np.array(cam['translation'], dtype='float32') 42 | cam['translation'] = cam['translation']/1000 # mm to meters 43 | 44 | cam['id'] = video_name 45 | cam['res_w'] = res['w'] 46 | cam['res_h'] = res['h'] 47 | 48 | self._cameras[video_name] = [cam] 49 | 50 | self._data[video_name] = { 51 | 'custom': { 52 | 'cameras': cam 53 | } 54 | } 55 | 56 | if remove_static_joints: 57 | # Bring the skeleton to 17 joints instead of the original 32 58 | self.remove_joints([4, 5, 9, 10, 11, 16, 20, 21, 22, 23, 24, 28, 29, 30, 31]) 59 | 60 | # Rewire shoulders to the correct parents 61 | self._skeleton._parents[11] = 8 62 | self._skeleton._parents[14] = 8 63 | 64 | def supports_semi_supervised(self): 65 | return False 66 | -------------------------------------------------------------------------------- /common/generators.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2018-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | # 7 | 8 | from itertools import zip_longest 9 | import numpy as np 10 | 11 | class ChunkedGenerator: 12 | """ 13 | Batched data generator, used for training. 14 | The sequences are split into equal-length chunks and padded as necessary. 15 | 16 | Arguments: 17 | batch_size -- the batch size to use for training 18 | cameras -- list of cameras, one element for each video (optional, used for semi-supervised training) 19 | poses_3d -- list of ground-truth 3D poses, one element for each video (optional, used for supervised training) 20 | poses_2d -- list of input 2D keypoints, one element for each video 21 | chunk_length -- number of output frames to predict for each training example (usually 1) 22 | pad -- 2D input padding to compensate for valid convolutions, per side (depends on the receptive field) 23 | causal_shift -- asymmetric padding offset when causal convolutions are used (usually 0 or "pad") 24 | shuffle -- randomly shuffle the dataset before each epoch 25 | random_seed -- initial seed to use for the random generator 26 | augment -- augment the dataset by flipping poses horizontally 27 | kps_left and kps_right -- list of left/right 2D keypoints if flipping is enabled 28 | joints_left and joints_right -- list of left/right 3D joints if flipping is enabled 29 | """ 30 | def __init__(self, batch_size, cameras, poses_3d, poses_2d, 31 | chunk_length, pad=0, causal_shift=0, 32 | shuffle=True, random_seed=1234, 33 | augment=False, kps_left=None, kps_right=None, joints_left=None, joints_right=None, 34 | endless=False): 35 | assert poses_3d is None or len(poses_3d) == len(poses_2d), (len(poses_3d), len(poses_2d)) 36 | assert cameras is None or len(cameras) == len(poses_2d) 37 | 38 | # Build lineage info 39 | pairs = [] # (seq_idx, start_frame, end_frame, flip) tuples 40 | for i in range(len(poses_2d)): 41 | assert poses_3d is None or poses_3d[i].shape[0] == poses_3d[i].shape[0] 42 | n_chunks = (poses_2d[i].shape[0] + chunk_length - 1) // chunk_length 43 | offset = (n_chunks * chunk_length - poses_2d[i].shape[0]) // 2 44 | bounds = np.arange(n_chunks+1)*chunk_length - offset 45 | augment_vector = np.full(len(bounds - 1), False, dtype=bool) 46 | pairs += zip(np.repeat(i, len(bounds - 1)), bounds[:-1], bounds[1:], augment_vector) 47 | if augment: 48 | pairs += zip(np.repeat(i, len(bounds - 1)), bounds[:-1], bounds[1:], ~augment_vector) 49 | 50 | # Initialize buffers 51 | if cameras is not None: 52 | self.batch_cam = np.empty((batch_size, cameras[0].shape[-1])) 53 | if poses_3d is not None: 54 | self.batch_3d = np.empty((batch_size, chunk_length, poses_3d[0].shape[-2], poses_3d[0].shape[-1])) 55 | self.batch_2d = np.empty((batch_size, chunk_length + 2*pad, poses_2d[0].shape[-2], poses_2d[0].shape[-1])) 56 | 57 | self.num_batches = (len(pairs) + batch_size - 1) // batch_size 58 | self.batch_size = batch_size 59 | self.random = np.random.RandomState(random_seed) 60 | self.pairs = pairs 61 | self.shuffle = shuffle 62 | self.pad = pad 63 | self.causal_shift = causal_shift 64 | self.endless = endless 65 | self.state = None 66 | 67 | self.cameras = cameras 68 | self.poses_3d = poses_3d 69 | self.poses_2d = poses_2d 70 | 71 | self.augment = augment 72 | self.kps_left = kps_left 73 | self.kps_right = kps_right 74 | self.joints_left = joints_left 75 | self.joints_right = joints_right 76 | 77 | def num_frames(self): 78 | return self.num_batches * self.batch_size 79 | 80 | def random_state(self): 81 | return self.random 82 | 83 | def set_random_state(self, random): 84 | self.random = random 85 | 86 | def augment_enabled(self): 87 | return self.augment 88 | 89 | def next_pairs(self): 90 | if self.state is None: 91 | if self.shuffle: 92 | pairs = self.random.permutation(self.pairs) 93 | else: 94 | pairs = self.pairs 95 | return 0, pairs 96 | else: 97 | return self.state 98 | 99 | def next_epoch(self): 100 | enabled = True 101 | while enabled: 102 | start_idx, pairs = self.next_pairs() 103 | for b_i in range(start_idx, self.num_batches): 104 | chunks = pairs[b_i*self.batch_size : (b_i+1)*self.batch_size] 105 | for i, (seq_i, start_3d, end_3d, flip) in enumerate(chunks): 106 | start_2d = start_3d - self.pad - self.causal_shift 107 | end_2d = end_3d + self.pad - self.causal_shift 108 | 109 | # 2D poses 110 | seq_2d = self.poses_2d[seq_i] 111 | low_2d = max(start_2d, 0) 112 | high_2d = min(end_2d, seq_2d.shape[0]) 113 | pad_left_2d = low_2d - start_2d 114 | pad_right_2d = end_2d - high_2d 115 | if pad_left_2d != 0 or pad_right_2d != 0: 116 | self.batch_2d[i] = np.pad(seq_2d[low_2d:high_2d], ((pad_left_2d, pad_right_2d), (0, 0), (0, 0)), 'edge') 117 | else: 118 | self.batch_2d[i] = seq_2d[low_2d:high_2d] 119 | 120 | if flip: 121 | # Flip 2D keypoints 122 | self.batch_2d[i, :, :, 0] *= -1 123 | self.batch_2d[i, :, self.kps_left + self.kps_right] = self.batch_2d[i, :, self.kps_right + self.kps_left] 124 | 125 | # 3D poses 126 | if self.poses_3d is not None: 127 | seq_3d = self.poses_3d[seq_i] 128 | low_3d = max(start_3d, 0) 129 | high_3d = min(end_3d, seq_3d.shape[0]) 130 | pad_left_3d = low_3d - start_3d 131 | pad_right_3d = end_3d - high_3d 132 | if pad_left_3d != 0 or pad_right_3d != 0: 133 | self.batch_3d[i] = np.pad(seq_3d[low_3d:high_3d], ((pad_left_3d, pad_right_3d), (0, 0), (0, 0)), 'edge') 134 | else: 135 | self.batch_3d[i] = seq_3d[low_3d:high_3d] 136 | 137 | if flip: 138 | # Flip 3D joints 139 | self.batch_3d[i, :, :, 0] *= -1 140 | self.batch_3d[i, :, self.joints_left + self.joints_right] = \ 141 | self.batch_3d[i, :, self.joints_right + self.joints_left] 142 | 143 | # Cameras 144 | if self.cameras is not None: 145 | self.batch_cam[i] = self.cameras[seq_i] 146 | if flip: 147 | # Flip horizontal distortion coefficients 148 | self.batch_cam[i, 2] *= -1 149 | self.batch_cam[i, 7] *= -1 150 | 151 | if self.endless: 152 | self.state = (b_i + 1, pairs) 153 | if self.poses_3d is None and self.cameras is None: 154 | yield None, None, self.batch_2d[:len(chunks)] 155 | elif self.poses_3d is not None and self.cameras is None: 156 | yield None, self.batch_3d[:len(chunks)], self.batch_2d[:len(chunks)] 157 | elif self.poses_3d is None: 158 | yield self.batch_cam[:len(chunks)], None, self.batch_2d[:len(chunks)] 159 | else: 160 | yield self.batch_cam[:len(chunks)], self.batch_3d[:len(chunks)], self.batch_2d[:len(chunks)] 161 | 162 | if self.endless: 163 | self.state = None 164 | else: 165 | enabled = False 166 | 167 | 168 | class UnchunkedGenerator: 169 | """ 170 | Non-batched data generator, used for testing. 171 | Sequences are returned one at a time (i.e. batch size = 1), without chunking. 172 | 173 | If data augmentation is enabled, the batches contain two sequences (i.e. batch size = 2), 174 | the second of which is a mirrored version of the first. 175 | 176 | Arguments: 177 | cameras -- list of cameras, one element for each video (optional, used for semi-supervised training) 178 | poses_3d -- list of ground-truth 3D poses, one element for each video (optional, used for supervised training) 179 | poses_2d -- list of input 2D keypoints, one element for each video 180 | pad -- 2D input padding to compensate for valid convolutions, per side (depends on the receptive field) 181 | causal_shift -- asymmetric padding offset when causal convolutions are used (usually 0 or "pad") 182 | augment -- augment the dataset by flipping poses horizontally 183 | kps_left and kps_right -- list of left/right 2D keypoints if flipping is enabled 184 | joints_left and joints_right -- list of left/right 3D joints if flipping is enabled 185 | """ 186 | 187 | def __init__(self, cameras, poses_3d, poses_2d, pad=0, causal_shift=0, 188 | augment=False, kps_left=None, kps_right=None, joints_left=None, joints_right=None): 189 | assert poses_3d is None or len(poses_3d) == len(poses_2d) 190 | assert cameras is None or len(cameras) == len(poses_2d) 191 | 192 | self.augment = augment 193 | self.kps_left = kps_left 194 | self.kps_right = kps_right 195 | self.joints_left = joints_left 196 | self.joints_right = joints_right 197 | 198 | self.pad = pad 199 | self.causal_shift = causal_shift 200 | self.cameras = [] if cameras is None else cameras 201 | self.poses_3d = [] if poses_3d is None else poses_3d 202 | self.poses_2d = poses_2d 203 | 204 | def num_frames(self): 205 | count = 0 206 | for p in self.poses_2d: 207 | count += p.shape[0] 208 | return count 209 | 210 | def augment_enabled(self): 211 | return self.augment 212 | 213 | def set_augment(self, augment): 214 | self.augment = augment 215 | 216 | def next_epoch(self): 217 | for seq_cam, seq_3d, seq_2d in zip_longest(self.cameras, self.poses_3d, self.poses_2d): 218 | batch_cam = None if seq_cam is None else np.expand_dims(seq_cam, axis=0) 219 | batch_3d = None if seq_3d is None else np.expand_dims(seq_3d, axis=0) 220 | batch_2d = np.expand_dims(np.pad(seq_2d, 221 | ((self.pad + self.causal_shift, self.pad - self.causal_shift), (0, 0), (0, 0)), 222 | 'edge'), axis=0) 223 | if self.augment: 224 | # Append flipped version 225 | if batch_cam is not None: 226 | batch_cam = np.concatenate((batch_cam, batch_cam), axis=0) 227 | batch_cam[1, 2] *= -1 228 | batch_cam[1, 7] *= -1 229 | 230 | if batch_3d is not None: 231 | batch_3d = np.concatenate((batch_3d, batch_3d), axis=0) 232 | batch_3d[1, :, :, 0] *= -1 233 | batch_3d[1, :, self.joints_left + self.joints_right] = batch_3d[1, :, self.joints_right + self.joints_left] 234 | 235 | batch_2d = np.concatenate((batch_2d, batch_2d), axis=0) 236 | batch_2d[1, :, :, 0] *= -1 237 | batch_2d[1, :, self.kps_left + self.kps_right] = batch_2d[1, :, self.kps_right + self.kps_left] 238 | 239 | yield batch_cam, batch_3d, batch_2d -------------------------------------------------------------------------------- /common/h36m_dataset.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2018-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | # 7 | 8 | import numpy as np 9 | import copy 10 | from common.skeleton import Skeleton 11 | from common.mocap_dataset import MocapDataset 12 | from common.camera import normalize_screen_coordinates, image_coordinates 13 | 14 | h36m_skeleton = Skeleton(parents=[-1, 0, 1, 2, 3, 4, 0, 6, 7, 8, 9, 0, 11, 12, 13, 14, 12, 15 | 16, 17, 18, 19, 20, 19, 22, 12, 24, 25, 26, 27, 28, 27, 30], 16 | joints_left=[6, 7, 8, 9, 10, 16, 17, 18, 19, 20, 21, 22, 23], 17 | joints_right=[1, 2, 3, 4, 5, 24, 25, 26, 27, 28, 29, 30, 31]) 18 | 19 | h36m_cameras_intrinsic_params = [ 20 | { 21 | 'id': '54138969', 22 | 'center': [512.54150390625, 515.4514770507812], 23 | 'focal_length': [1145.0494384765625, 1143.7811279296875], 24 | 'radial_distortion': [-0.20709891617298126, 0.24777518212795258, -0.0030751503072679043], 25 | 'tangential_distortion': [-0.0009756988729350269, -0.00142447161488235], 26 | 'res_w': 1000, 27 | 'res_h': 1002, 28 | 'azimuth': 70, # Only used for visualization 29 | }, 30 | { 31 | 'id': '55011271', 32 | 'center': [508.8486328125, 508.0649108886719], 33 | 'focal_length': [1149.6756591796875, 1147.5916748046875], 34 | 'radial_distortion': [-0.1942136287689209, 0.2404085397720337, 0.006819975562393665], 35 | 'tangential_distortion': [-0.0016190266469493508, -0.0027408944442868233], 36 | 'res_w': 1000, 37 | 'res_h': 1000, 38 | 'azimuth': -70, # Only used for visualization 39 | }, 40 | { 41 | 'id': '58860488', 42 | 'center': [519.8158569335938, 501.40264892578125], 43 | 'focal_length': [1149.1407470703125, 1148.7989501953125], 44 | 'radial_distortion': [-0.2083381861448288, 0.25548800826072693, -0.0024604974314570427], 45 | 'tangential_distortion': [0.0014843869721516967, -0.0007599993259645998], 46 | 'res_w': 1000, 47 | 'res_h': 1000, 48 | 'azimuth': 110, # Only used for visualization 49 | }, 50 | { 51 | 'id': '60457274', 52 | 'center': [514.9682006835938, 501.88201904296875], 53 | 'focal_length': [1145.5113525390625, 1144.77392578125], 54 | 'radial_distortion': [-0.198384091258049, 0.21832367777824402, -0.008947807364165783], 55 | 'tangential_distortion': [-0.0005872055771760643, -0.0018133620033040643], 56 | 'res_w': 1000, 57 | 'res_h': 1002, 58 | 'azimuth': -110, # Only used for visualization 59 | }, 60 | ] 61 | 62 | h36m_cameras_extrinsic_params = { 63 | 'S1': [ 64 | { 65 | 'orientation': [0.1407056450843811, -0.1500701755285263, -0.755240797996521, 0.6223280429840088], 66 | 'translation': [1841.1070556640625, 4955.28466796875, 1563.4454345703125], 67 | }, 68 | { 69 | 'orientation': [0.6157187819480896, -0.764836311340332, -0.14833825826644897, 0.11794740706682205], 70 | 'translation': [1761.278564453125, -5078.0068359375, 1606.2650146484375], 71 | }, 72 | { 73 | 'orientation': [0.14651472866535187, -0.14647851884365082, 0.7653023600578308, -0.6094175577163696], 74 | 'translation': [-1846.7777099609375, 5215.04638671875, 1491.972412109375], 75 | }, 76 | { 77 | 'orientation': [0.5834008455276489, -0.7853162288665771, 0.14548823237419128, -0.14749594032764435], 78 | 'translation': [-1794.7896728515625, -3722.698974609375, 1574.8927001953125], 79 | }, 80 | ], 81 | 'S2': [ 82 | {}, 83 | {}, 84 | {}, 85 | {}, 86 | ], 87 | 'S3': [ 88 | {}, 89 | {}, 90 | {}, 91 | {}, 92 | ], 93 | 'S4': [ 94 | {}, 95 | {}, 96 | {}, 97 | {}, 98 | ], 99 | 'S5': [ 100 | { 101 | 'orientation': [0.1467377245426178, -0.162370964884758, -0.7551892995834351, 0.6178938746452332], 102 | 'translation': [2097.3916015625, 4880.94482421875, 1605.732421875], 103 | }, 104 | { 105 | 'orientation': [0.6159758567810059, -0.7626792192459106, -0.15728192031383514, 0.1189815029501915], 106 | 'translation': [2031.7008056640625, -5167.93310546875, 1612.923095703125], 107 | }, 108 | { 109 | 'orientation': [0.14291371405124664, -0.12907841801643372, 0.7678384780883789, -0.6110143065452576], 110 | 'translation': [-1620.5948486328125, 5171.65869140625, 1496.43701171875], 111 | }, 112 | { 113 | 'orientation': [0.5920479893684387, -0.7814217805862427, 0.1274748593568802, -0.15036417543888092], 114 | 'translation': [-1637.1737060546875, -3867.3173828125, 1547.033203125], 115 | }, 116 | ], 117 | 'S6': [ 118 | { 119 | 'orientation': [0.1337897777557373, -0.15692396461963654, -0.7571090459823608, 0.6198879480361938], 120 | 'translation': [1935.4517822265625, 4950.24560546875, 1618.0838623046875], 121 | }, 122 | { 123 | 'orientation': [0.6147197484970093, -0.7628812789916992, -0.16174767911434174, 0.11819244921207428], 124 | 'translation': [1969.803955078125, -5128.73876953125, 1632.77880859375], 125 | }, 126 | { 127 | 'orientation': [0.1529948115348816, -0.13529130816459656, 0.7646096348762512, -0.6112781167030334], 128 | 'translation': [-1769.596435546875, 5185.361328125, 1476.993408203125], 129 | }, 130 | { 131 | 'orientation': [0.5916101336479187, -0.7804774045944214, 0.12832270562648773, -0.1561593860387802], 132 | 'translation': [-1721.668701171875, -3884.13134765625, 1540.4879150390625], 133 | }, 134 | ], 135 | 'S7': [ 136 | { 137 | 'orientation': [0.1435241848230362, -0.1631336808204651, -0.7548328638076782, 0.6188824772834778], 138 | 'translation': [1974.512939453125, 4926.3544921875, 1597.8326416015625], 139 | }, 140 | { 141 | 'orientation': [0.6141672730445862, -0.7638262510299683, -0.1596645563840866, 0.1177929937839508], 142 | 'translation': [1937.0584716796875, -5119.7900390625, 1631.5665283203125], 143 | }, 144 | { 145 | 'orientation': [0.14550060033798218, -0.12874816358089447, 0.7660516500473022, -0.6127139329910278], 146 | 'translation': [-1741.8111572265625, 5208.24951171875, 1464.8245849609375], 147 | }, 148 | { 149 | 'orientation': [0.5912848114967346, -0.7821764349937439, 0.12445473670959473, -0.15196487307548523], 150 | 'translation': [-1734.7105712890625, -3832.42138671875, 1548.5830078125], 151 | }, 152 | ], 153 | 'S8': [ 154 | { 155 | 'orientation': [0.14110587537288666, -0.15589867532253265, -0.7561917304992676, 0.619644045829773], 156 | 'translation': [2150.65185546875, 4896.1611328125, 1611.9046630859375], 157 | }, 158 | { 159 | 'orientation': [0.6169601678848267, -0.7647668123245239, -0.14846350252628326, 0.11158157885074615], 160 | 'translation': [2219.965576171875, -5148.453125, 1613.0440673828125], 161 | }, 162 | { 163 | 'orientation': [0.1471444070339203, -0.13377119600772858, 0.7670128345489502, -0.6100369691848755], 164 | 'translation': [-1571.2215576171875, 5137.0185546875, 1498.1761474609375], 165 | }, 166 | { 167 | 'orientation': [0.5927824378013611, -0.7825870513916016, 0.12147816270589828, -0.14631995558738708], 168 | 'translation': [-1476.913330078125, -3896.7412109375, 1547.97216796875], 169 | }, 170 | ], 171 | 'S9': [ 172 | { 173 | 'orientation': [0.15540587902069092, -0.15548215806484222, -0.7532095313072205, 0.6199594736099243], 174 | 'translation': [2044.45849609375, 4935.1171875, 1481.2275390625], 175 | }, 176 | { 177 | 'orientation': [0.618784487247467, -0.7634735107421875, -0.14132238924503326, 0.11933968216180801], 178 | 'translation': [1990.959716796875, -5123.810546875, 1568.8048095703125], 179 | }, 180 | { 181 | 'orientation': [0.13357827067375183, -0.1367100477218628, 0.7689454555511475, -0.6100738644599915], 182 | 'translation': [-1670.9921875, 5211.98583984375, 1528.387939453125], 183 | }, 184 | { 185 | 'orientation': [0.5879399180412292, -0.7823407053947449, 0.1427614390850067, -0.14794869720935822], 186 | 'translation': [-1696.04345703125, -3827.099853515625, 1591.4127197265625], 187 | }, 188 | ], 189 | 'S11': [ 190 | { 191 | 'orientation': [0.15232472121715546, -0.15442320704460144, -0.7547563314437866, 0.6191070079803467], 192 | 'translation': [2098.440185546875, 4926.5546875, 1500.278564453125], 193 | }, 194 | { 195 | 'orientation': [0.6189449429512024, -0.7600917220115662, -0.15300633013248444, 0.1255258321762085], 196 | 'translation': [2083.182373046875, -4912.1728515625, 1561.07861328125], 197 | }, 198 | { 199 | 'orientation': [0.14943228662014008, -0.15650227665901184, 0.7681233882904053, -0.6026304364204407], 200 | 'translation': [-1609.8153076171875, 5177.3359375, 1537.896728515625], 201 | }, 202 | { 203 | 'orientation': [0.5894251465797424, -0.7818877100944519, 0.13991211354732513, -0.14715361595153809], 204 | 'translation': [-1590.738037109375, -3854.1689453125, 1578.017578125], 205 | }, 206 | ], 207 | } 208 | 209 | class Human36mDataset(MocapDataset): 210 | def __init__(self, path, remove_static_joints=True): 211 | super().__init__(fps=50, skeleton=h36m_skeleton) 212 | 213 | self._cameras = copy.deepcopy(h36m_cameras_extrinsic_params) 214 | for cameras in self._cameras.values(): 215 | for i, cam in enumerate(cameras): 216 | cam.update(h36m_cameras_intrinsic_params[i]) 217 | for k, v in cam.items(): 218 | if k not in ['id', 'res_w', 'res_h']: 219 | cam[k] = np.array(v, dtype='float32') 220 | 221 | # Normalize camera frame 222 | cam['center'] = normalize_screen_coordinates(cam['center'], w=cam['res_w'], h=cam['res_h']).astype('float32') 223 | cam['focal_length'] = cam['focal_length']/cam['res_w']*2 224 | if 'translation' in cam: 225 | cam['translation'] = cam['translation']/1000 # mm to meters 226 | 227 | # Add intrinsic parameters vector 228 | cam['intrinsic'] = np.concatenate((cam['focal_length'], 229 | cam['center'], 230 | cam['radial_distortion'], 231 | cam['tangential_distortion'])) 232 | 233 | # Load serialized dataset 234 | data = np.load(path, allow_pickle=True)['positions_3d'].item() 235 | 236 | self._data = {} 237 | for subject, actions in data.items(): 238 | self._data[subject] = {} 239 | for action_name, positions in actions.items(): 240 | self._data[subject][action_name] = { 241 | 'positions': positions, 242 | 'cameras': self._cameras[subject], 243 | } 244 | 245 | if remove_static_joints: 246 | # Bring the skeleton to 17 joints instead of the original 32 247 | self.remove_joints([4, 5, 9, 10, 11, 16, 20, 21, 22, 23, 24, 28, 29, 30, 31]) 248 | 249 | # Rewire shoulders to the correct parents 250 | self._skeleton._parents[11] = 8 251 | self._skeleton._parents[14] = 8 252 | 253 | def supports_semi_supervised(self): 254 | return True 255 | -------------------------------------------------------------------------------- /common/humaneva_dataset.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2018-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | # 7 | 8 | import numpy as np 9 | import copy 10 | from common.skeleton import Skeleton 11 | from common.mocap_dataset import MocapDataset 12 | from common.camera import normalize_screen_coordinates, image_coordinates 13 | 14 | humaneva_skeleton = Skeleton(parents=[-1, 0, 1, 2, 3, 1, 5, 6, 0, 8, 9, 0, 11, 12, 1], 15 | joints_left=[2, 3, 4, 8, 9, 10], 16 | joints_right=[5, 6, 7, 11, 12, 13]) 17 | 18 | humaneva_cameras_intrinsic_params = [ 19 | { 20 | 'id': 'C1', 21 | 'res_w': 640, 22 | 'res_h': 480, 23 | 'azimuth': 0, # Only used for visualization 24 | }, 25 | { 26 | 'id': 'C2', 27 | 'res_w': 640, 28 | 'res_h': 480, 29 | 'azimuth': -90, # Only used for visualization 30 | }, 31 | { 32 | 'id': 'C3', 33 | 'res_w': 640, 34 | 'res_h': 480, 35 | 'azimuth': 90, # Only used for visualization 36 | }, 37 | ] 38 | 39 | humaneva_cameras_extrinsic_params = { 40 | 'S1': [ 41 | { 42 | 'orientation': [0.424207, -0.4983646, -0.5802981, 0.4847012], 43 | 'translation': [4062.227, 663.2477, 1528.397], 44 | }, 45 | { 46 | 'orientation': [0.6503354, -0.7481602, -0.0919284, 0.0941766], 47 | 'translation': [844.8131, -3805.2092, 1504.9929], 48 | }, 49 | { 50 | 'orientation': [0.0664734, -0.0690535, 0.7416416, -0.6639132], 51 | 'translation': [-797.67377, 3916.3174, 1433.6602], 52 | }, 53 | ], 54 | 'S2': [ 55 | { 56 | 'orientation': [ 0.4214752, -0.4961493, -0.5838273, 0.4851187 ], 57 | 'translation': [ 4112.9121, 626.4929, 1545.2988], 58 | }, 59 | { 60 | 'orientation': [ 0.6501393, -0.7476588, -0.0954617, 0.0959808 ], 61 | 'translation': [ 923.5740, -3877.9243, 1504.5518], 62 | }, 63 | { 64 | 'orientation': [ 0.0699353, -0.0712403, 0.7421637, -0.662742 ], 65 | 'translation': [ -781.4915, 3838.8853, 1444.9929], 66 | }, 67 | ], 68 | 'S3': [ 69 | { 70 | 'orientation': [ 0.424207, -0.4983646, -0.5802981, 0.4847012 ], 71 | 'translation': [ 4062.2271, 663.2477, 1528.3970], 72 | }, 73 | { 74 | 'orientation': [ 0.6503354, -0.7481602, -0.0919284, 0.0941766 ], 75 | 'translation': [ 844.8131, -3805.2092, 1504.9929], 76 | }, 77 | { 78 | 'orientation': [ 0.0664734, -0.0690535, 0.7416416, -0.6639132 ], 79 | 'translation': [ -797.6738, 3916.3174, 1433.6602], 80 | }, 81 | ], 82 | 'S4': [ 83 | {}, 84 | {}, 85 | {}, 86 | ], 87 | 88 | } 89 | 90 | class HumanEvaDataset(MocapDataset): 91 | def __init__(self, path): 92 | super().__init__(fps=60, skeleton=humaneva_skeleton) 93 | 94 | self._cameras = copy.deepcopy(humaneva_cameras_extrinsic_params) 95 | for cameras in self._cameras.values(): 96 | for i, cam in enumerate(cameras): 97 | cam.update(humaneva_cameras_intrinsic_params[i]) 98 | for k, v in cam.items(): 99 | if k not in ['id', 'res_w', 'res_h']: 100 | cam[k] = np.array(v, dtype='float32') 101 | if 'translation' in cam: 102 | cam['translation'] = cam['translation']/1000 # mm to meters 103 | 104 | for subject in list(self._cameras.keys()): 105 | data = self._cameras[subject] 106 | del self._cameras[subject] 107 | for prefix in ['Train/', 'Validate/', 'Unlabeled/Train/', 'Unlabeled/Validate/', 'Unlabeled/']: 108 | self._cameras[prefix + subject] = data 109 | 110 | # Load serialized dataset 111 | data = np.load(path, allow_pickle=True)['positions_3d'].item() 112 | 113 | self._data = {} 114 | for subject, actions in data.items(): 115 | self._data[subject] = {} 116 | for action_name, positions in actions.items(): 117 | self._data[subject][action_name] = { 118 | 'positions': positions, 119 | 'cameras': self._cameras[subject], 120 | } 121 | -------------------------------------------------------------------------------- /common/loss.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2018-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | # 7 | 8 | import torch 9 | import numpy as np 10 | 11 | def mpjpe(predicted, target): 12 | """ 13 | Mean per-joint position error (i.e. mean Euclidean distance), 14 | often referred to as "Protocol #1" in many papers. 15 | """ 16 | assert predicted.shape == target.shape 17 | return torch.mean(torch.norm(predicted - target, dim=len(target.shape)-1)) 18 | 19 | def weighted_mpjpe(predicted, target, w): 20 | """ 21 | Weighted mean per-joint position error (i.e. mean Euclidean distance) 22 | """ 23 | assert predicted.shape == target.shape 24 | assert w.shape[0] == predicted.shape[0] 25 | return torch.mean(w * torch.norm(predicted - target, dim=len(target.shape)-1)) 26 | 27 | def p_mpjpe(predicted, target): 28 | """ 29 | Pose error: MPJPE after rigid alignment (scale, rotation, and translation), 30 | often referred to as "Protocol #2" in many papers. 31 | """ 32 | assert predicted.shape == target.shape 33 | 34 | muX = np.mean(target, axis=1, keepdims=True) 35 | muY = np.mean(predicted, axis=1, keepdims=True) 36 | 37 | X0 = target - muX 38 | Y0 = predicted - muY 39 | 40 | normX = np.sqrt(np.sum(X0**2, axis=(1, 2), keepdims=True)) 41 | normY = np.sqrt(np.sum(Y0**2, axis=(1, 2), keepdims=True)) 42 | 43 | X0 /= normX 44 | Y0 /= normY 45 | 46 | H = np.matmul(X0.transpose(0, 2, 1), Y0) 47 | U, s, Vt = np.linalg.svd(H) 48 | V = Vt.transpose(0, 2, 1) 49 | R = np.matmul(V, U.transpose(0, 2, 1)) 50 | 51 | # Avoid improper rotations (reflections), i.e. rotations with det(R) = -1 52 | sign_detR = np.sign(np.expand_dims(np.linalg.det(R), axis=1)) 53 | V[:, :, -1] *= sign_detR 54 | s[:, -1] *= sign_detR.flatten() 55 | R = np.matmul(V, U.transpose(0, 2, 1)) # Rotation 56 | 57 | tr = np.expand_dims(np.sum(s, axis=1, keepdims=True), axis=2) 58 | 59 | a = tr * normX / normY # Scale 60 | t = muX - a*np.matmul(muY, R) # Translation 61 | 62 | # Perform rigid transformation on the input 63 | predicted_aligned = a*np.matmul(predicted, R) + t 64 | 65 | # Return MPJPE 66 | return np.mean(np.linalg.norm(predicted_aligned - target, axis=len(target.shape)-1)) 67 | 68 | def n_mpjpe(predicted, target): 69 | """ 70 | Normalized MPJPE (scale only), adapted from: 71 | https://github.com/hrhodin/UnsupervisedGeometryAwareRepresentationLearning/blob/master/losses/poses.py 72 | """ 73 | assert predicted.shape == target.shape 74 | 75 | norm_predicted = torch.mean(torch.sum(predicted**2, dim=3, keepdim=True), dim=2, keepdim=True) 76 | norm_target = torch.mean(torch.sum(target*predicted, dim=3, keepdim=True), dim=2, keepdim=True) 77 | scale = norm_target / norm_predicted 78 | return mpjpe(scale * predicted, target) 79 | 80 | def mean_velocity_error(predicted, target): 81 | """ 82 | Mean per-joint velocity error (i.e. mean Euclidean distance of the 1st derivative) 83 | """ 84 | assert predicted.shape == target.shape 85 | 86 | velocity_predicted = np.diff(predicted, axis=0) 87 | velocity_target = np.diff(target, axis=0) 88 | 89 | return np.mean(np.linalg.norm(velocity_predicted - velocity_target, axis=len(target.shape)-1)) -------------------------------------------------------------------------------- /common/mocap_dataset.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2018-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | # 7 | 8 | import numpy as np 9 | from common.skeleton import Skeleton 10 | 11 | class MocapDataset: 12 | def __init__(self, fps, skeleton): 13 | self._skeleton = skeleton 14 | self._fps = fps 15 | self._data = None # Must be filled by subclass 16 | self._cameras = None # Must be filled by subclass 17 | 18 | def remove_joints(self, joints_to_remove): 19 | kept_joints = self._skeleton.remove_joints(joints_to_remove) 20 | for subject in self._data.keys(): 21 | for action in self._data[subject].keys(): 22 | s = self._data[subject][action] 23 | if 'positions' in s: 24 | s['positions'] = s['positions'][:, kept_joints] 25 | 26 | 27 | def __getitem__(self, key): 28 | return self._data[key] 29 | 30 | def subjects(self): 31 | return self._data.keys() 32 | 33 | def fps(self): 34 | return self._fps 35 | 36 | def skeleton(self): 37 | return self._skeleton 38 | 39 | def cameras(self): 40 | return self._cameras 41 | 42 | def supports_semi_supervised(self): 43 | # This method can be overridden 44 | return False -------------------------------------------------------------------------------- /common/model.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2018-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | # 7 | 8 | import torch.nn as nn 9 | 10 | class TemporalModelBase(nn.Module): 11 | """ 12 | Do not instantiate this class. 13 | """ 14 | 15 | def __init__(self, num_joints_in, in_features, num_joints_out, 16 | filter_widths, causal, dropout, channels): 17 | super().__init__() 18 | 19 | # Validate input 20 | for fw in filter_widths: 21 | assert fw % 2 != 0, 'Only odd filter widths are supported' 22 | 23 | self.num_joints_in = num_joints_in 24 | self.in_features = in_features 25 | self.num_joints_out = num_joints_out 26 | self.filter_widths = filter_widths 27 | 28 | self.drop = nn.Dropout(dropout) 29 | self.relu = nn.ReLU(inplace=True) 30 | 31 | self.pad = [ filter_widths[0] // 2 ] 32 | self.expand_bn = nn.BatchNorm1d(channels, momentum=0.1) 33 | self.shrink = nn.Conv1d(channels, num_joints_out*3, 1) 34 | 35 | 36 | def set_bn_momentum(self, momentum): 37 | self.expand_bn.momentum = momentum 38 | for bn in self.layers_bn: 39 | bn.momentum = momentum 40 | 41 | def receptive_field(self): 42 | """ 43 | Return the total receptive field of this model as # of frames. 44 | """ 45 | frames = 0 46 | for f in self.pad: 47 | frames += f 48 | return 1 + 2*frames 49 | 50 | def total_causal_shift(self): 51 | """ 52 | Return the asymmetric offset for sequence padding. 53 | The returned value is typically 0 if causal convolutions are disabled, 54 | otherwise it is half the receptive field. 55 | """ 56 | frames = self.causal_shift[0] 57 | next_dilation = self.filter_widths[0] 58 | for i in range(1, len(self.filter_widths)): 59 | frames += self.causal_shift[i] * next_dilation 60 | next_dilation *= self.filter_widths[i] 61 | return frames 62 | 63 | def forward(self, x): 64 | assert len(x.shape) == 4 65 | assert x.shape[-2] == self.num_joints_in 66 | assert x.shape[-1] == self.in_features 67 | 68 | sz = x.shape[:3] 69 | x = x.view(x.shape[0], x.shape[1], -1) 70 | x = x.permute(0, 2, 1) 71 | 72 | x = self._forward_blocks(x) 73 | 74 | x = x.permute(0, 2, 1) 75 | x = x.view(sz[0], -1, self.num_joints_out, 3) 76 | 77 | return x 78 | 79 | class TemporalModel(TemporalModelBase): 80 | """ 81 | Reference 3D pose estimation model with temporal convolutions. 82 | This implementation can be used for all use-cases. 83 | """ 84 | 85 | def __init__(self, num_joints_in, in_features, num_joints_out, 86 | filter_widths, causal=False, dropout=0.25, channels=1024, dense=False): 87 | """ 88 | Initialize this model. 89 | 90 | Arguments: 91 | num_joints_in -- number of input joints (e.g. 17 for Human3.6M) 92 | in_features -- number of input features for each joint (typically 2 for 2D input) 93 | num_joints_out -- number of output joints (can be different than input) 94 | filter_widths -- list of convolution widths, which also determines the # of blocks and receptive field 95 | causal -- use causal convolutions instead of symmetric convolutions (for real-time applications) 96 | dropout -- dropout probability 97 | channels -- number of convolution channels 98 | dense -- use regular dense convolutions instead of dilated convolutions (ablation experiment) 99 | """ 100 | super().__init__(num_joints_in, in_features, num_joints_out, filter_widths, causal, dropout, channels) 101 | 102 | self.expand_conv = nn.Conv1d(num_joints_in*in_features, channels, filter_widths[0], bias=False) 103 | 104 | layers_conv = [] 105 | layers_bn = [] 106 | 107 | self.causal_shift = [ (filter_widths[0]) // 2 if causal else 0 ] 108 | next_dilation = filter_widths[0] 109 | for i in range(1, len(filter_widths)): 110 | self.pad.append((filter_widths[i] - 1)*next_dilation // 2) 111 | self.causal_shift.append((filter_widths[i]//2 * next_dilation) if causal else 0) 112 | 113 | layers_conv.append(nn.Conv1d(channels, channels, 114 | filter_widths[i] if not dense else (2*self.pad[-1] + 1), 115 | dilation=next_dilation if not dense else 1, 116 | bias=False)) 117 | layers_bn.append(nn.BatchNorm1d(channels, momentum=0.1)) 118 | layers_conv.append(nn.Conv1d(channels, channels, 1, dilation=1, bias=False)) 119 | layers_bn.append(nn.BatchNorm1d(channels, momentum=0.1)) 120 | 121 | next_dilation *= filter_widths[i] 122 | 123 | self.layers_conv = nn.ModuleList(layers_conv) 124 | self.layers_bn = nn.ModuleList(layers_bn) 125 | 126 | def _forward_blocks(self, x): 127 | x = self.drop(self.relu(self.expand_bn(self.expand_conv(x)))) 128 | 129 | for i in range(len(self.pad) - 1): 130 | pad = self.pad[i+1] 131 | shift = self.causal_shift[i+1] 132 | res = x[:, :, pad + shift : x.shape[2] - pad + shift] 133 | 134 | x = self.drop(self.relu(self.layers_bn[2*i](self.layers_conv[2*i](x)))) 135 | x = res + self.drop(self.relu(self.layers_bn[2*i + 1](self.layers_conv[2*i + 1](x)))) 136 | 137 | x = self.shrink(x) 138 | return x 139 | 140 | class TemporalModelOptimized1f(TemporalModelBase): 141 | """ 142 | 3D pose estimation model optimized for single-frame batching, i.e. 143 | where batches have input length = receptive field, and output length = 1. 144 | This scenario is only used for training when stride == 1. 145 | 146 | This implementation replaces dilated convolutions with strided convolutions 147 | to avoid generating unused intermediate results. The weights are interchangeable 148 | with the reference implementation. 149 | """ 150 | 151 | def __init__(self, num_joints_in, in_features, num_joints_out, 152 | filter_widths, causal=False, dropout=0.25, channels=1024): 153 | """ 154 | Initialize this model. 155 | 156 | Arguments: 157 | num_joints_in -- number of input joints (e.g. 17 for Human3.6M) 158 | in_features -- number of input features for each joint (typically 2 for 2D input) 159 | num_joints_out -- number of output joints (can be different than input) 160 | filter_widths -- list of convolution widths, which also determines the # of blocks and receptive field 161 | causal -- use causal convolutions instead of symmetric convolutions (for real-time applications) 162 | dropout -- dropout probability 163 | channels -- number of convolution channels 164 | """ 165 | super().__init__(num_joints_in, in_features, num_joints_out, filter_widths, causal, dropout, channels) 166 | 167 | self.expand_conv = nn.Conv1d(num_joints_in*in_features, channels, filter_widths[0], stride=filter_widths[0], bias=False) 168 | 169 | layers_conv = [] 170 | layers_bn = [] 171 | 172 | self.causal_shift = [ (filter_widths[0] // 2) if causal else 0 ] 173 | next_dilation = filter_widths[0] 174 | for i in range(1, len(filter_widths)): 175 | self.pad.append((filter_widths[i] - 1)*next_dilation // 2) 176 | self.causal_shift.append((filter_widths[i]//2) if causal else 0) 177 | 178 | layers_conv.append(nn.Conv1d(channels, channels, filter_widths[i], stride=filter_widths[i], bias=False)) 179 | layers_bn.append(nn.BatchNorm1d(channels, momentum=0.1)) 180 | layers_conv.append(nn.Conv1d(channels, channels, 1, dilation=1, bias=False)) 181 | layers_bn.append(nn.BatchNorm1d(channels, momentum=0.1)) 182 | next_dilation *= filter_widths[i] 183 | 184 | self.layers_conv = nn.ModuleList(layers_conv) 185 | self.layers_bn = nn.ModuleList(layers_bn) 186 | 187 | def _forward_blocks(self, x): 188 | x = self.drop(self.relu(self.expand_bn(self.expand_conv(x)))) 189 | 190 | for i in range(len(self.pad) - 1): 191 | res = x[:, :, self.causal_shift[i+1] + self.filter_widths[i+1]//2 :: self.filter_widths[i+1]] 192 | 193 | x = self.drop(self.relu(self.layers_bn[2*i](self.layers_conv[2*i](x)))) 194 | x = res + self.drop(self.relu(self.layers_bn[2*i + 1](self.layers_conv[2*i + 1](x)))) 195 | 196 | x = self.shrink(x) 197 | return x -------------------------------------------------------------------------------- /common/quaternion.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2018-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | # 7 | 8 | import torch 9 | 10 | def qrot(q, v): 11 | """ 12 | Rotate vector(s) v about the rotation described by quaternion(s) q. 13 | Expects a tensor of shape (*, 4) for q and a tensor of shape (*, 3) for v, 14 | where * denotes any number of dimensions. 15 | Returns a tensor of shape (*, 3). 16 | """ 17 | assert q.shape[-1] == 4 18 | assert v.shape[-1] == 3 19 | assert q.shape[:-1] == v.shape[:-1] 20 | 21 | qvec = q[..., 1:] 22 | uv = torch.cross(qvec, v, dim=len(q.shape)-1) 23 | uuv = torch.cross(qvec, uv, dim=len(q.shape)-1) 24 | return (v + 2 * (q[..., :1] * uv + uuv)) 25 | 26 | 27 | def qinverse(q, inplace=False): 28 | # We assume the quaternion to be normalized 29 | if inplace: 30 | q[..., 1:] *= -1 31 | return q 32 | else: 33 | w = q[..., :1] 34 | xyz = q[..., 1:] 35 | return torch.cat((w, -xyz), dim=len(q.shape)-1) -------------------------------------------------------------------------------- /common/skeleton.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2018-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | # 7 | 8 | import numpy as np 9 | 10 | class Skeleton: 11 | def __init__(self, parents, joints_left, joints_right): 12 | assert len(joints_left) == len(joints_right) 13 | 14 | self._parents = np.array(parents) 15 | self._joints_left = joints_left 16 | self._joints_right = joints_right 17 | self._compute_metadata() 18 | 19 | def num_joints(self): 20 | return len(self._parents) 21 | 22 | def parents(self): 23 | return self._parents 24 | 25 | def has_children(self): 26 | return self._has_children 27 | 28 | def children(self): 29 | return self._children 30 | 31 | def remove_joints(self, joints_to_remove): 32 | """ 33 | Remove the joints specified in 'joints_to_remove'. 34 | """ 35 | valid_joints = [] 36 | for joint in range(len(self._parents)): 37 | if joint not in joints_to_remove: 38 | valid_joints.append(joint) 39 | 40 | for i in range(len(self._parents)): 41 | while self._parents[i] in joints_to_remove: 42 | self._parents[i] = self._parents[self._parents[i]] 43 | 44 | index_offsets = np.zeros(len(self._parents), dtype=int) 45 | new_parents = [] 46 | for i, parent in enumerate(self._parents): 47 | if i not in joints_to_remove: 48 | new_parents.append(parent - index_offsets[parent]) 49 | else: 50 | index_offsets[i:] += 1 51 | self._parents = np.array(new_parents) 52 | 53 | 54 | if self._joints_left is not None: 55 | new_joints_left = [] 56 | for joint in self._joints_left: 57 | if joint in valid_joints: 58 | new_joints_left.append(joint - index_offsets[joint]) 59 | self._joints_left = new_joints_left 60 | if self._joints_right is not None: 61 | new_joints_right = [] 62 | for joint in self._joints_right: 63 | if joint in valid_joints: 64 | new_joints_right.append(joint - index_offsets[joint]) 65 | self._joints_right = new_joints_right 66 | 67 | self._compute_metadata() 68 | 69 | return valid_joints 70 | 71 | def joints_left(self): 72 | return self._joints_left 73 | 74 | def joints_right(self): 75 | return self._joints_right 76 | 77 | def _compute_metadata(self): 78 | self._has_children = np.zeros(len(self._parents)).astype(bool) 79 | for i, parent in enumerate(self._parents): 80 | if parent != -1: 81 | self._has_children[parent] = True 82 | 83 | self._children = [] 84 | for i, parent in enumerate(self._parents): 85 | self._children.append([]) 86 | for i, parent in enumerate(self._parents): 87 | if parent != -1: 88 | self._children[parent].append(i) -------------------------------------------------------------------------------- /common/utils.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2018-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | # 7 | 8 | import torch 9 | import numpy as np 10 | import hashlib 11 | 12 | def wrap(func, *args, unsqueeze=False): 13 | """ 14 | Wrap a torch function so it can be called with NumPy arrays. 15 | Input and return types are seamlessly converted. 16 | """ 17 | 18 | # Convert input types where applicable 19 | args = list(args) 20 | for i, arg in enumerate(args): 21 | if type(arg) == np.ndarray: 22 | args[i] = torch.from_numpy(arg) 23 | if unsqueeze: 24 | args[i] = args[i].unsqueeze(0) 25 | 26 | result = func(*args) 27 | 28 | # Convert output types where applicable 29 | if isinstance(result, tuple): 30 | result = list(result) 31 | for i, res in enumerate(result): 32 | if type(res) == torch.Tensor: 33 | if unsqueeze: 34 | res = res.squeeze(0) 35 | result[i] = res.numpy() 36 | return tuple(result) 37 | elif type(result) == torch.Tensor: 38 | if unsqueeze: 39 | result = result.squeeze(0) 40 | return result.numpy() 41 | else: 42 | return result 43 | 44 | def deterministic_random(min_value, max_value, data): 45 | digest = hashlib.sha256(data.encode()).digest() 46 | raw_value = int.from_bytes(digest[:4], byteorder='little', signed=False) 47 | return int(raw_value / (2**32 - 1) * (max_value - min_value)) + min_value -------------------------------------------------------------------------------- /common/visualization.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2018-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | # 7 | 8 | import matplotlib 9 | matplotlib.use('Agg') 10 | 11 | import matplotlib.pyplot as plt 12 | from matplotlib.animation import FuncAnimation, writers 13 | from mpl_toolkits.mplot3d import Axes3D 14 | import numpy as np 15 | import subprocess as sp 16 | 17 | def get_resolution(filename): 18 | command = ['ffprobe', '-v', 'error', '-select_streams', 'v:0', 19 | '-show_entries', 'stream=width,height', '-of', 'csv=p=0', filename] 20 | with sp.Popen(command, stdout=sp.PIPE, bufsize=-1) as pipe: 21 | for line in pipe.stdout: 22 | w, h = line.decode().strip().split(',') 23 | return int(w), int(h) 24 | 25 | def get_fps(filename): 26 | command = ['ffprobe', '-v', 'error', '-select_streams', 'v:0', 27 | '-show_entries', 'stream=r_frame_rate', '-of', 'csv=p=0', filename] 28 | with sp.Popen(command, stdout=sp.PIPE, bufsize=-1) as pipe: 29 | for line in pipe.stdout: 30 | a, b = line.decode().strip().split('/') 31 | return int(a) / int(b) 32 | 33 | def read_video(filename, skip=0, limit=-1): 34 | w, h = get_resolution(filename) 35 | 36 | command = ['ffmpeg', 37 | '-i', filename, 38 | '-f', 'image2pipe', 39 | '-pix_fmt', 'rgb24', 40 | '-vsync', '0', 41 | '-vcodec', 'rawvideo', '-'] 42 | 43 | i = 0 44 | with sp.Popen(command, stdout = sp.PIPE, bufsize=-1) as pipe: 45 | while True: 46 | data = pipe.stdout.read(w*h*3) 47 | if not data: 48 | break 49 | i += 1 50 | if i > limit and limit != -1: 51 | continue 52 | if i > skip: 53 | yield np.frombuffer(data, dtype='uint8').reshape((h, w, 3)) 54 | 55 | 56 | 57 | 58 | def downsample_tensor(X, factor): 59 | length = X.shape[0]//factor * factor 60 | return np.mean(X[:length].reshape(-1, factor, *X.shape[1:]), axis=1) 61 | 62 | def render_animation(keypoints, keypoints_metadata, poses, skeleton, fps, bitrate, azim, output, viewport, 63 | limit=-1, downsample=1, size=6, input_video_path=None, input_video_skip=0): 64 | """ 65 | TODO 66 | Render an animation. The supported output modes are: 67 | -- 'interactive': display an interactive figure 68 | (also works on notebooks if associated with %matplotlib inline) 69 | -- 'html': render the animation as HTML5 video. Can be displayed in a notebook using HTML(...). 70 | -- 'filename.mp4': render and export the animation as an h264 video (requires ffmpeg). 71 | -- 'filename.gif': render and export the animation a gif file (requires imagemagick). 72 | """ 73 | plt.ioff() 74 | fig = plt.figure(figsize=(size*(1 + len(poses)), size)) 75 | ax_in = fig.add_subplot(1, 1 + len(poses), 1) 76 | ax_in.get_xaxis().set_visible(False) 77 | ax_in.get_yaxis().set_visible(False) 78 | ax_in.set_axis_off() 79 | ax_in.set_title('Input') 80 | 81 | ax_3d = [] 82 | lines_3d = [] 83 | trajectories = [] 84 | radius = 1.7 85 | for index, (title, data) in enumerate(poses.items()): 86 | ax = fig.add_subplot(1, 1 + len(poses), index+2, projection='3d') 87 | ax.view_init(elev=15., azim=azim) 88 | ax.set_xlim3d([-radius/2, radius/2]) 89 | ax.set_zlim3d([0, radius]) 90 | ax.set_ylim3d([-radius/2, radius/2]) 91 | try: 92 | ax.set_aspect('equal') 93 | except NotImplementedError: 94 | ax.set_aspect('auto') 95 | ax.set_xticklabels([]) 96 | ax.set_yticklabels([]) 97 | ax.set_zticklabels([]) 98 | ax.dist = 7.5 99 | ax.set_title(title) #, pad=35 100 | ax_3d.append(ax) 101 | lines_3d.append([]) 102 | trajectories.append(data[:, 0, [0, 1]]) 103 | poses = list(poses.values()) 104 | 105 | # Decode video 106 | if input_video_path is None: 107 | # Black background 108 | all_frames = np.zeros((keypoints.shape[0], viewport[1], viewport[0]), dtype='uint8') 109 | else: 110 | # Load video using ffmpeg 111 | all_frames = [] 112 | for f in read_video(input_video_path, skip=input_video_skip, limit=limit): 113 | all_frames.append(f) 114 | effective_length = min(keypoints.shape[0], len(all_frames)) 115 | all_frames = all_frames[:effective_length] 116 | 117 | keypoints = keypoints[input_video_skip:] # todo remove 118 | for idx in range(len(poses)): 119 | poses[idx] = poses[idx][input_video_skip:] 120 | 121 | if fps is None: 122 | fps = get_fps(input_video_path) 123 | 124 | if downsample > 1: 125 | keypoints = downsample_tensor(keypoints, downsample) 126 | all_frames = downsample_tensor(np.array(all_frames), downsample).astype('uint8') 127 | for idx in range(len(poses)): 128 | poses[idx] = downsample_tensor(poses[idx], downsample) 129 | trajectories[idx] = downsample_tensor(trajectories[idx], downsample) 130 | fps /= downsample 131 | 132 | initialized = False 133 | image = None 134 | lines = [] 135 | points = None 136 | 137 | if limit < 1: 138 | limit = len(all_frames) 139 | else: 140 | limit = min(limit, len(all_frames)) 141 | 142 | parents = skeleton.parents() 143 | def update_video(i): 144 | nonlocal initialized, image, lines, points 145 | 146 | for n, ax in enumerate(ax_3d): 147 | ax.set_xlim3d([-radius/2 + trajectories[n][i, 0], radius/2 + trajectories[n][i, 0]]) 148 | ax.set_ylim3d([-radius/2 + trajectories[n][i, 1], radius/2 + trajectories[n][i, 1]]) 149 | 150 | # Update 2D poses 151 | joints_right_2d = keypoints_metadata['keypoints_symmetry'][1] 152 | colors_2d = np.full(keypoints.shape[1], 'black') 153 | colors_2d[joints_right_2d] = 'red' 154 | if not initialized: 155 | image = ax_in.imshow(all_frames[i], aspect='equal') 156 | 157 | for j, j_parent in enumerate(parents): 158 | if j_parent == -1: 159 | continue 160 | 161 | if len(parents) == keypoints.shape[1] and keypoints_metadata['layout_name'] != 'coco': 162 | # Draw skeleton only if keypoints match (otherwise we don't have the parents definition) 163 | lines.append(ax_in.plot([keypoints[i, j, 0], keypoints[i, j_parent, 0]], 164 | [keypoints[i, j, 1], keypoints[i, j_parent, 1]], color='pink')) 165 | 166 | col = 'red' if j in skeleton.joints_right() else 'black' 167 | for n, ax in enumerate(ax_3d): 168 | pos = poses[n][i] 169 | lines_3d[n].append(ax.plot([pos[j, 0], pos[j_parent, 0]], 170 | [pos[j, 1], pos[j_parent, 1]], 171 | [pos[j, 2], pos[j_parent, 2]], zdir='z', c=col)) 172 | 173 | points = ax_in.scatter(*keypoints[i].T, 10, color=colors_2d, edgecolors='white', zorder=10) 174 | 175 | initialized = True 176 | else: 177 | image.set_data(all_frames[i]) 178 | 179 | for j, j_parent in enumerate(parents): 180 | if j_parent == -1: 181 | continue 182 | 183 | if len(parents) == keypoints.shape[1] and keypoints_metadata['layout_name'] != 'coco': 184 | lines[j-1][0].set_data([keypoints[i, j, 0], keypoints[i, j_parent, 0]], 185 | [keypoints[i, j, 1], keypoints[i, j_parent, 1]]) 186 | 187 | for n, ax in enumerate(ax_3d): 188 | pos = poses[n][i] 189 | lines_3d[n][j-1][0].set_xdata(np.array([pos[j, 0], pos[j_parent, 0]])) 190 | lines_3d[n][j-1][0].set_ydata(np.array([pos[j, 1], pos[j_parent, 1]])) 191 | lines_3d[n][j-1][0].set_3d_properties(np.array([pos[j, 2], pos[j_parent, 2]]), zdir='z') 192 | 193 | points.set_offsets(keypoints[i]) 194 | 195 | print('{}/{} '.format(i, limit), end='\r') 196 | 197 | 198 | fig.tight_layout() 199 | 200 | anim = FuncAnimation(fig, update_video, frames=np.arange(0, limit), interval=1000/fps, repeat=False) 201 | if output.endswith('.mp4'): 202 | Writer = writers['ffmpeg'] 203 | writer = Writer(fps=fps, metadata={}, bitrate=bitrate) 204 | anim.save(output, writer=writer) 205 | elif output.endswith('.gif'): 206 | anim.save(output, dpi=80, writer='imagemagick') 207 | else: 208 | raise ValueError('Unsupported output format (only .mp4 and .gif are supported)') 209 | plt.close() -------------------------------------------------------------------------------- /data/ConvertHumanEva.m: -------------------------------------------------------------------------------- 1 | % Copyright (c) 2018-present, Facebook, Inc. 2 | % All rights reserved. 3 | % 4 | % This source code is licensed under the license found in the 5 | % LICENSE file in the root directory of this source tree. 6 | % 7 | 8 | function [] = ConvertDataset() 9 | 10 | N_JOINTS = 15; % Set to 20 if you want to export a 20-joint skeleton 11 | 12 | function [pose_out] = ExtractPose15(pose, dimensions) 13 | % We use the same 15-joint skeleton as in the evaluation 14 | % script "@body_pose/error.m". Proximal and Distal joints 15 | % are averaged. 16 | pose_out = NaN(15, dimensions); 17 | pose_out(1, :) = pose.torsoDistal; % Pelvis (root) 18 | pose_out(2, :) = (pose.torsoProximal + pose.headProximal) / 2; % Thorax 19 | pose_out(3, :) = pose.upperLArmProximal; % Left shoulder 20 | pose_out(4, :) = (pose.upperLArmDistal + pose.lowerLArmProximal) / 2; % Left elbow 21 | pose_out(5, :) = pose.lowerLArmDistal; % Left wrist 22 | pose_out(6, :) = pose.upperRArmProximal; % Right shoulder 23 | pose_out(7, :) = (pose.upperRArmDistal + pose.lowerRArmProximal) / 2; % Right elbow 24 | pose_out(8, :) = pose.lowerRArmDistal; % Right wrist 25 | pose_out(9, :) = pose.upperLLegProximal; % Left hip 26 | pose_out(10, :) = (pose.upperLLegDistal + pose.lowerLLegProximal) / 2; % Left knee 27 | pose_out(11, :) = pose.lowerLLegDistal; % Left ankle 28 | pose_out(12, :) = pose.upperRLegProximal; % Right hip 29 | pose_out(13, :) = (pose.upperRLegDistal + pose.lowerRLegProximal) / 2; % Right knee 30 | pose_out(14, :) = pose.lowerRLegDistal; % Right ankle 31 | pose_out(15, :) = pose.headDistal; % Head 32 | end 33 | 34 | function [pose_out] = ExtractPose20(pose, dimensions) 35 | pose_out = NaN(20, dimensions); 36 | pose_out(1, :) = pose.torsoDistal; % Pelvis (root) 37 | pose_out(2, :) = pose.torsoProximal; 38 | pose_out(3, :) = pose.headProximal; 39 | pose_out(4, :) = pose.upperLArmProximal; % Left shoulder 40 | pose_out(5, :) = pose.upperLArmDistal; 41 | pose_out(6, :) = pose.lowerLArmProximal; 42 | pose_out(7, :) = pose.lowerLArmDistal; % Left wrist 43 | pose_out(8, :) = pose.upperRArmProximal; % Right shoulder 44 | pose_out(9, :) = pose.upperRArmDistal; 45 | pose_out(10, :) = pose.lowerRArmProximal; 46 | pose_out(11, :) = pose.lowerRArmDistal; % Right wrist 47 | pose_out(12, :) = pose.upperLLegProximal; % Left hip 48 | pose_out(13, :) = pose.upperLLegDistal; 49 | pose_out(14, :) = pose.lowerLLegProximal; 50 | pose_out(15, :) = pose.lowerLLegDistal; % Left ankle 51 | pose_out(16, :) = pose.upperRLegProximal; % Right hip 52 | pose_out(17, :) = pose.upperRLegDistal; 53 | pose_out(18, :) = pose.lowerRLegProximal; 54 | pose_out(19, :) = pose.lowerRLegDistal; % Right ankle 55 | pose_out(20, :) = pose.headDistal; % Head 56 | end 57 | 58 | addpath('./TOOLBOX_calib/'); 59 | addpath('./TOOLBOX_common/'); 60 | addpath('./TOOLBOX_dxAvi/'); 61 | addpath('./TOOLBOX_readc3d/'); 62 | 63 | % Create the output directory for the converted dataset 64 | OUT_DIR = ['./converted_', int2str(N_JOINTS), 'j']; 65 | warning('off', 'MATLAB:MKDIR:DirectoryExists'); 66 | mkdir(OUT_DIR); 67 | 68 | % We use the validation set as the test set 69 | for SPLIT = {'Train', 'Validate'} 70 | mkdir([OUT_DIR, '/', SPLIT{1}]); 71 | CurrentDataset = he_dataset('HumanEvaI', SPLIT{1}); 72 | 73 | for SEQ = 1:length(CurrentDataset) 74 | 75 | Subject = char(get(CurrentDataset(SEQ), 'SubjectName')); 76 | Action = char(get(CurrentDataset(SEQ), 'ActionType')); 77 | Trial = char(get(CurrentDataset(SEQ), 'Trial')); 78 | DatasetBasePath = char(get(CurrentDataset(SEQ), 'DatasetBasePath')); 79 | if Trial ~= '1' 80 | % We are only interested in fully-annotated data 81 | continue; 82 | end 83 | 84 | if strcmp(Action, 'ThrowCatch') && strcmp(Subject, 'S3') 85 | % Damaged mocap stream 86 | continue; 87 | end 88 | 89 | fprintf('Converting...\n') 90 | fprintf('\tSplit: %s\n', SPLIT{1}); 91 | fprintf('\tSubject: %s\n', Subject); 92 | fprintf('\tAction: %s\n', Action); 93 | fprintf('\tTrial: %s\n', Trial); 94 | 95 | % Create subject directory if it does not exist 96 | mkdir([OUT_DIR, '/', SPLIT{1}, '/', Subject]); 97 | 98 | % Load the sequence 99 | [~, ~, MocapStream, MocapStream_Enabled] ... 100 | = sync_stream(CurrentDataset(SEQ)); 101 | 102 | % Set frame range 103 | FrameStart = get(CurrentDataset(SEQ), 'FrameStart'); 104 | FrameStart = [FrameStart{:}]; 105 | FrameEnd = get(CurrentDataset(SEQ), 'FrameEnd'); 106 | FrameEnd = [FrameEnd{:}]; 107 | 108 | fprintf('\tNum. frames: %d\n', FrameEnd - FrameStart + 1); 109 | poses_3d = NaN(FrameEnd - FrameStart + 1, N_JOINTS, 3); 110 | poses_2d = NaN(3, FrameEnd - FrameStart + 1, N_JOINTS, 2); 111 | corrupt = 0; 112 | for FRAME = FrameStart:FrameEnd 113 | 114 | if (MocapStream_Enabled) 115 | [MocapStream, pose, ValidPose] = cur_frame(MocapStream, FRAME, 'body_pose'); 116 | 117 | if (ValidPose) 118 | i = FRAME - FrameStart + 1; 119 | 120 | % Extract 3D pose 121 | if N_JOINTS == 15 122 | poses_3d(i, :, :) = ExtractPose15(pose, 3); 123 | else 124 | poses_3d(i, :, :) = ExtractPose20(pose, 3); 125 | end 126 | 127 | % Extract ground-truth 2D pose via camera 128 | % projection 129 | for CAM = 1:3 130 | if (CAM == 1) 131 | CameraName = 'C1'; 132 | elseif (CAM == 2) 133 | CameraName = 'C2'; 134 | elseif (CAM == 3) 135 | CameraName = 'C3'; 136 | end 137 | CalibrationFilename = [DatasetBasePath, Subject, '/Calibration_Data/', CameraName, '.cal']; 138 | pose_2d = project2d(pose, CalibrationFilename); 139 | if N_JOINTS == 15 140 | poses_2d(CAM, i, :, :) = ExtractPose15(pose_2d, 2); 141 | else 142 | poses_2d(CAM, i, :, :) = ExtractPose20(pose_2d, 2); 143 | end 144 | end 145 | 146 | else 147 | corrupt = corrupt + 1; 148 | end 149 | end 150 | end 151 | fprintf('\n%d out of %d frames are damaged\n', corrupt, FrameEnd - FrameStart + 1); 152 | FileName = [OUT_DIR, '/', SPLIT{1}, '/', Subject, '/', Action, '_', Trial, '.mat']; 153 | save(FileName, 'poses_3d', 'poses_2d'); 154 | fprintf('... saved to %s\n\n', FileName); 155 | end 156 | end 157 | end 158 | -------------------------------------------------------------------------------- /data/convert_cdf_to_mat.m: -------------------------------------------------------------------------------- 1 | % Copyright (c) 2018-present, Facebook, Inc. 2 | % All rights reserved. 3 | % 4 | % This source code is licensed under the license found in the 5 | % LICENSE file in the root directory of this source tree. 6 | % 7 | 8 | % Extract "Poses_D3_Positions_S*.tgz" to the "pose" directory 9 | % and run this script to convert all .cdf files to .mat 10 | 11 | pose_directory = 'pose'; 12 | dirs = dir(strcat(pose_directory, '/*/MyPoseFeatures/D3_Positions/*.cdf')); 13 | 14 | paths = {dirs.folder}; 15 | names = {dirs.name}; 16 | 17 | for i = 1:numel(names) 18 | data = cdfread(strcat(paths{i}, '/', names{i})); 19 | save(strcat(paths{i}, '/', names{i}, '.mat'), 'data'); 20 | end -------------------------------------------------------------------------------- /data/data_utils.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2018-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | # 7 | 8 | import numpy as np 9 | 10 | mpii_metadata = { 11 | 'layout_name': 'mpii', 12 | 'num_joints': 16, 13 | 'keypoints_symmetry': [ 14 | [3, 4, 5, 13, 14, 15], 15 | [0, 1, 2, 10, 11, 12], 16 | ] 17 | } 18 | 19 | coco_metadata = { 20 | 'layout_name': 'coco', 21 | 'num_joints': 17, 22 | 'keypoints_symmetry': [ 23 | [1, 3, 5, 7, 9, 11, 13, 15], 24 | [2, 4, 6, 8, 10, 12, 14, 16], 25 | ] 26 | } 27 | 28 | h36m_metadata = { 29 | 'layout_name': 'h36m', 30 | 'num_joints': 17, 31 | 'keypoints_symmetry': [ 32 | [4, 5, 6, 11, 12, 13], 33 | [1, 2, 3, 14, 15, 16], 34 | ] 35 | } 36 | 37 | humaneva15_metadata = { 38 | 'layout_name': 'humaneva15', 39 | 'num_joints': 15, 40 | 'keypoints_symmetry': [ 41 | [2, 3, 4, 8, 9, 10], 42 | [5, 6, 7, 11, 12, 13] 43 | ] 44 | } 45 | 46 | humaneva20_metadata = { 47 | 'layout_name': 'humaneva20', 48 | 'num_joints': 20, 49 | 'keypoints_symmetry': [ 50 | [3, 4, 5, 6, 11, 12, 13, 14], 51 | [7, 8, 9, 10, 15, 16, 17, 18] 52 | ] 53 | } 54 | 55 | def suggest_metadata(name): 56 | names = [] 57 | for metadata in [mpii_metadata, coco_metadata, h36m_metadata, humaneva15_metadata, humaneva20_metadata]: 58 | if metadata['layout_name'] in name: 59 | return metadata 60 | names.append(metadata['layout_name']) 61 | raise KeyError('Cannot infer keypoint layout from name "{}". Tried {}.'.format(name, names)) 62 | 63 | def import_detectron_poses(path): 64 | # Latin1 encoding because Detectron runs on Python 2.7 65 | data = np.load(path, encoding='latin1') 66 | kp = data['keypoints'] 67 | bb = data['boxes'] 68 | results = [] 69 | for i in range(len(bb)): 70 | if len(bb[i][1]) == 0: 71 | assert i > 0 72 | # Use last pose in case of detection failure 73 | results.append(results[-1]) 74 | continue 75 | best_match = np.argmax(bb[i][1][:, 4]) 76 | keypoints = kp[i][1][best_match].T.copy() 77 | results.append(keypoints) 78 | results = np.array(results) 79 | return results[:, :, 4:6] # Soft-argmax 80 | #return results[:, :, [0, 1, 3]] # Argmax + score 81 | 82 | 83 | def import_cpn_poses(path): 84 | data = np.load(path) 85 | kp = data['keypoints'] 86 | return kp[:, :, :2] 87 | 88 | 89 | def import_sh_poses(path): 90 | import h5py 91 | with h5py.File(path) as hf: 92 | positions = hf['poses'].value 93 | return positions.astype('float32') 94 | 95 | def suggest_pose_importer(name): 96 | if 'detectron' in name: 97 | return import_detectron_poses 98 | if 'cpn' in name: 99 | return import_cpn_poses 100 | if 'sh' in name: 101 | return import_sh_poses 102 | raise KeyError('Cannot infer keypoint format from name "{}". Tried detectron, cpn, sh.'.format(name)) 103 | -------------------------------------------------------------------------------- /data/prepare_data_2d_custom.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2018-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | # 7 | 8 | import numpy as np 9 | from glob import glob 10 | import os 11 | import sys 12 | 13 | import argparse 14 | from data_utils import suggest_metadata 15 | 16 | output_prefix_2d = 'data_2d_custom_' 17 | 18 | def decode(filename): 19 | # Latin1 encoding because Detectron runs on Python 2.7 20 | print('Processing {}'.format(filename)) 21 | data = np.load(filename, encoding='latin1', allow_pickle=True) 22 | bb = data['boxes'] 23 | kp = data['keypoints'] 24 | metadata = data['metadata'].item() 25 | results_bb = [] 26 | results_kp = [] 27 | for i in range(len(bb)): 28 | if len(bb[i][1]) == 0 or len(kp[i][1]) == 0: 29 | # No bbox/keypoints detected for this frame -> will be interpolated 30 | results_bb.append(np.full(4, np.nan, dtype=np.float32)) # 4 bounding box coordinates 31 | results_kp.append(np.full((17, 4), np.nan, dtype=np.float32)) # 17 COCO keypoints 32 | continue 33 | best_match = np.argmax(bb[i][1][:, 4]) 34 | best_bb = bb[i][1][best_match, :4] 35 | best_kp = kp[i][1][best_match].T.copy() 36 | results_bb.append(best_bb) 37 | results_kp.append(best_kp) 38 | 39 | bb = np.array(results_bb, dtype=np.float32) 40 | kp = np.array(results_kp, dtype=np.float32) 41 | kp = kp[:, :, :2] # Extract (x, y) 42 | 43 | # Fix missing bboxes/keypoints by linear interpolation 44 | mask = ~np.isnan(bb[:, 0]) 45 | indices = np.arange(len(bb)) 46 | for i in range(4): 47 | bb[:, i] = np.interp(indices, indices[mask], bb[mask, i]) 48 | for i in range(17): 49 | for j in range(2): 50 | kp[:, i, j] = np.interp(indices, indices[mask], kp[mask, i, j]) 51 | 52 | print('{} total frames processed'.format(len(bb))) 53 | print('{} frames were interpolated'.format(np.sum(~mask))) 54 | print('----------') 55 | 56 | return [{ 57 | 'start_frame': 0, # Inclusive 58 | 'end_frame': len(kp), # Exclusive 59 | 'bounding_boxes': bb, 60 | 'keypoints': kp, 61 | }], metadata 62 | 63 | 64 | if __name__ == '__main__': 65 | if os.path.basename(os.getcwd()) != 'data': 66 | print('This script must be launched from the "data" directory') 67 | exit(0) 68 | 69 | parser = argparse.ArgumentParser(description='Custom dataset creator') 70 | parser.add_argument('-i', '--input', type=str, default='', metavar='PATH', help='detections directory') 71 | parser.add_argument('-o', '--output', type=str, default='', metavar='PATH', help='output suffix for 2D detections') 72 | args = parser.parse_args() 73 | 74 | if not args.input: 75 | print('Please specify the input directory') 76 | exit(0) 77 | 78 | if not args.output: 79 | print('Please specify an output suffix (e.g. detectron_pt_coco)') 80 | exit(0) 81 | 82 | print('Parsing 2D detections from', args.input) 83 | 84 | metadata = suggest_metadata('coco') 85 | metadata['video_metadata'] = {} 86 | 87 | output = {} 88 | file_list = glob(args.input + '/*.npz') 89 | for f in file_list: 90 | canonical_name = os.path.splitext(os.path.basename(f))[0] 91 | data, video_metadata = decode(f) 92 | output[canonical_name] = {} 93 | output[canonical_name]['custom'] = [data[0]['keypoints'].astype('float32')] 94 | metadata['video_metadata'][canonical_name] = video_metadata 95 | 96 | print('Saving...') 97 | np.savez_compressed(output_prefix_2d + args.output, positions_2d=output, metadata=metadata) 98 | print('Done.') -------------------------------------------------------------------------------- /data/prepare_data_2d_h36m_generic.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2018-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | # 7 | 8 | import argparse 9 | import os 10 | import zipfile 11 | import numpy as np 12 | import h5py 13 | import re 14 | from glob import glob 15 | from shutil import rmtree 16 | from data_utils import suggest_metadata, suggest_pose_importer 17 | 18 | import sys 19 | sys.path.append('../') 20 | from common.utils import wrap 21 | from itertools import groupby 22 | 23 | output_prefix_2d = 'data_2d_h36m_' 24 | cam_map = { 25 | '54138969': 0, 26 | '55011271': 1, 27 | '58860488': 2, 28 | '60457274': 3, 29 | } 30 | 31 | if __name__ == '__main__': 32 | if os.path.basename(os.getcwd()) != 'data': 33 | print('This script must be launched from the "data" directory') 34 | exit(0) 35 | 36 | parser = argparse.ArgumentParser(description='Human3.6M dataset converter') 37 | 38 | parser.add_argument('-i', '--input', default='', type=str, metavar='PATH', help='input path to 2D detections') 39 | parser.add_argument('-o', '--output', default='', type=str, metavar='PATH', help='output suffix for 2D detections (e.g. detectron_pt_coco)') 40 | 41 | args = parser.parse_args() 42 | 43 | if not args.input: 44 | print('Please specify the input directory') 45 | exit(0) 46 | 47 | if not args.output: 48 | print('Please specify an output suffix (e.g. detectron_pt_coco)') 49 | exit(0) 50 | 51 | import_func = suggest_pose_importer(args.output) 52 | metadata = suggest_metadata(args.output) 53 | 54 | print('Parsing 2D detections from', args.input) 55 | 56 | output = {} 57 | file_list = glob(args.input + '/S*/*.mp4.npz') 58 | for f in file_list: 59 | path, fname = os.path.split(f) 60 | subject = os.path.basename(path) 61 | assert subject.startswith('S'), subject + ' does not look like a subject directory' 62 | 63 | if '_ALL' in fname: 64 | continue 65 | 66 | m = re.search('(.*)\\.([0-9]+)\\.mp4\\.npz', fname) 67 | action = m.group(1) 68 | camera = m.group(2) 69 | camera_idx = cam_map[camera] 70 | 71 | if subject == 'S11' and action == 'Directions': 72 | continue # Discard corrupted video 73 | 74 | # Use consistent naming convention 75 | canonical_name = action.replace('TakingPhoto', 'Photo') \ 76 | .replace('WalkingDog', 'WalkDog') 77 | 78 | keypoints = import_func(f) 79 | assert keypoints.shape[1] == metadata['num_joints'] 80 | 81 | if subject not in output: 82 | output[subject] = {} 83 | if canonical_name not in output[subject]: 84 | output[subject][canonical_name] = [None, None, None, None] 85 | output[subject][canonical_name][camera_idx] = keypoints.astype('float32') 86 | 87 | print('Saving...') 88 | np.savez_compressed(output_prefix_2d + args.output, positions_2d=output, metadata=metadata) 89 | print('Done.') -------------------------------------------------------------------------------- /data/prepare_data_2d_h36m_sh.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2018-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | # 7 | 8 | import argparse 9 | import os 10 | import zipfile 11 | import tarfile 12 | import numpy as np 13 | import h5py 14 | from glob import glob 15 | from shutil import rmtree 16 | 17 | import sys 18 | sys.path.append('../') 19 | from common.h36m_dataset import Human36mDataset 20 | from common.camera import world_to_camera, project_to_2d, image_coordinates 21 | from common.utils import wrap 22 | 23 | output_filename_pt = 'data_2d_h36m_sh_pt_mpii' 24 | output_filename_ft = 'data_2d_h36m_sh_ft_h36m' 25 | subjects = ['S1', 'S5', 'S6', 'S7', 'S8', 'S9', 'S11'] 26 | cam_map = { 27 | '54138969': 0, 28 | '55011271': 1, 29 | '58860488': 2, 30 | '60457274': 3, 31 | } 32 | 33 | metadata = { 34 | 'num_joints': 16, 35 | 'keypoints_symmetry': [ 36 | [3, 4, 5, 13, 14, 15], 37 | [0, 1, 2, 10, 11, 12], 38 | ] 39 | } 40 | 41 | def process_subject(subject, file_list, output): 42 | if subject == 'S11': 43 | assert len(file_list) == 119, "Expected 119 files for subject " + subject + ", got " + str(len(file_list)) 44 | else: 45 | assert len(file_list) == 120, "Expected 120 files for subject " + subject + ", got " + str(len(file_list)) 46 | 47 | for f in file_list: 48 | action, cam = os.path.splitext(os.path.basename(f))[0].replace('_', ' ').split('.') 49 | 50 | if subject == 'S11' and action == 'Directions': 51 | continue # Discard corrupted video 52 | 53 | if action not in output[subject]: 54 | output[subject][action] = [None, None, None, None] 55 | 56 | with h5py.File(f) as hf: 57 | positions = hf['poses'].value 58 | output[subject][action][cam_map[cam]] = positions.astype('float32') 59 | 60 | if __name__ == '__main__': 61 | if os.path.basename(os.getcwd()) != 'data': 62 | print('This script must be launched from the "data" directory') 63 | exit(0) 64 | 65 | parser = argparse.ArgumentParser(description='Human3.6M dataset downloader/converter') 66 | 67 | parser.add_argument('-pt', '--pretrained', default='', type=str, metavar='PATH', help='convert pretrained dataset') 68 | parser.add_argument('-ft', '--fine-tuned', default='', type=str, metavar='PATH', help='convert fine-tuned dataset') 69 | 70 | args = parser.parse_args() 71 | 72 | if args.pretrained: 73 | print('Converting pretrained dataset from', args.pretrained) 74 | print('Extracting...') 75 | with zipfile.ZipFile(args.pretrained, 'r') as archive: 76 | archive.extractall('sh_pt') 77 | 78 | print('Converting...') 79 | output = {} 80 | for subject in subjects: 81 | output[subject] = {} 82 | file_list = glob('sh_pt/h36m/' + subject + '/StackedHourglass/*.h5') 83 | process_subject(subject, file_list, output) 84 | 85 | print('Saving...') 86 | np.savez_compressed(output_filename_pt, positions_2d=output, metadata=metadata) 87 | 88 | print('Cleaning up...') 89 | rmtree('sh_pt') 90 | 91 | print('Done.') 92 | 93 | if args.fine_tuned: 94 | print('Converting fine-tuned dataset from', args.fine_tuned) 95 | print('Extracting...') 96 | with tarfile.open(args.fine_tuned, 'r:gz') as archive: 97 | archive.extractall('sh_ft') 98 | 99 | print('Converting...') 100 | output = {} 101 | for subject in subjects: 102 | output[subject] = {} 103 | file_list = glob('sh_ft/' + subject + '/StackedHourglassFineTuned240/*.h5') 104 | process_subject(subject, file_list, output) 105 | 106 | print('Saving...') 107 | np.savez_compressed(output_filename_ft, positions_2d=output, metadata=metadata) 108 | 109 | print('Cleaning up...') 110 | rmtree('sh_ft') 111 | 112 | print('Done.') 113 | -------------------------------------------------------------------------------- /data/prepare_data_h36m.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2018-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | # 7 | 8 | import argparse 9 | import os 10 | import zipfile 11 | import numpy as np 12 | import h5py 13 | from glob import glob 14 | from shutil import rmtree 15 | 16 | import sys 17 | sys.path.append('../') 18 | from common.h36m_dataset import Human36mDataset 19 | from common.camera import world_to_camera, project_to_2d, image_coordinates 20 | from common.utils import wrap 21 | 22 | output_filename = 'data_3d_h36m' 23 | output_filename_2d = 'data_2d_h36m_gt' 24 | subjects = ['S1', 'S5', 'S6', 'S7', 'S8', 'S9', 'S11'] 25 | 26 | if __name__ == '__main__': 27 | if os.path.basename(os.getcwd()) != 'data': 28 | print('This script must be launched from the "data" directory') 29 | exit(0) 30 | 31 | parser = argparse.ArgumentParser(description='Human3.6M dataset downloader/converter') 32 | 33 | # Convert dataset preprocessed by Martinez et al. in https://github.com/una-dinosauria/3d-pose-baseline 34 | parser.add_argument('--from-archive', default='', type=str, metavar='PATH', help='convert preprocessed dataset') 35 | 36 | # Convert dataset from original source, using files converted to .mat (the Human3.6M dataset path must be specified manually) 37 | # This option requires MATLAB to convert files using the provided script 38 | parser.add_argument('--from-source', default='', type=str, metavar='PATH', help='convert original dataset') 39 | 40 | # Convert dataset from original source, using original .cdf files (the Human3.6M dataset path must be specified manually) 41 | # This option does not require MATLAB, but the Python library cdflib must be installed 42 | parser.add_argument('--from-source-cdf', default='', type=str, metavar='PATH', help='convert original dataset') 43 | 44 | args = parser.parse_args() 45 | 46 | if args.from_archive and args.from_source: 47 | print('Please specify only one argument') 48 | exit(0) 49 | 50 | if os.path.exists(output_filename + '.npz'): 51 | print('The dataset already exists at', output_filename + '.npz') 52 | exit(0) 53 | 54 | if args.from_archive: 55 | print('Extracting Human3.6M dataset from', args.from_archive) 56 | with zipfile.ZipFile(args.from_archive, 'r') as archive: 57 | archive.extractall() 58 | 59 | print('Converting...') 60 | output = {} 61 | for subject in subjects: 62 | output[subject] = {} 63 | file_list = glob('h36m/' + subject + '/MyPoses/3D_positions/*.h5') 64 | assert len(file_list) == 30, "Expected 30 files for subject " + subject + ", got " + str(len(file_list)) 65 | for f in file_list: 66 | action = os.path.splitext(os.path.basename(f))[0] 67 | 68 | if subject == 'S11' and action == 'Directions': 69 | continue # Discard corrupted video 70 | 71 | with h5py.File(f) as hf: 72 | positions = hf['3D_positions'].value.reshape(32, 3, -1).transpose(2, 0, 1) 73 | positions /= 1000 # Meters instead of millimeters 74 | output[subject][action] = positions.astype('float32') 75 | 76 | print('Saving...') 77 | np.savez_compressed(output_filename, positions_3d=output) 78 | 79 | print('Cleaning up...') 80 | rmtree('h36m') 81 | 82 | print('Done.') 83 | 84 | elif args.from_source: 85 | print('Converting original Human3.6M dataset from', args.from_source) 86 | output = {} 87 | 88 | from scipy.io import loadmat 89 | 90 | for subject in subjects: 91 | output[subject] = {} 92 | file_list = glob(args.from_source + '/' + subject + '/MyPoseFeatures/D3_Positions/*.cdf.mat') 93 | assert len(file_list) == 30, "Expected 30 files for subject " + subject + ", got " + str(len(file_list)) 94 | for f in file_list: 95 | action = os.path.splitext(os.path.splitext(os.path.basename(f))[0])[0] 96 | 97 | if subject == 'S11' and action == 'Directions': 98 | continue # Discard corrupted video 99 | 100 | # Use consistent naming convention 101 | canonical_name = action.replace('TakingPhoto', 'Photo') \ 102 | .replace('WalkingDog', 'WalkDog') 103 | 104 | hf = loadmat(f) 105 | positions = hf['data'][0, 0].reshape(-1, 32, 3) 106 | positions /= 1000 # Meters instead of millimeters 107 | output[subject][canonical_name] = positions.astype('float32') 108 | 109 | print('Saving...') 110 | np.savez_compressed(output_filename, positions_3d=output) 111 | 112 | print('Done.') 113 | 114 | elif args.from_source_cdf: 115 | print('Converting original Human3.6M dataset from', args.from_source_cdf, '(CDF files)') 116 | output = {} 117 | 118 | import cdflib 119 | 120 | for subject in subjects: 121 | output[subject] = {} 122 | file_list = glob(args.from_source_cdf + '/' + subject + '/MyPoseFeatures/D3_Positions/*.cdf') 123 | assert len(file_list) == 30, "Expected 30 files for subject " + subject + ", got " + str(len(file_list)) 124 | for f in file_list: 125 | action = os.path.splitext(os.path.basename(f))[0] 126 | 127 | if subject == 'S11' and action == 'Directions': 128 | continue # Discard corrupted video 129 | 130 | # Use consistent naming convention 131 | canonical_name = action.replace('TakingPhoto', 'Photo') \ 132 | .replace('WalkingDog', 'WalkDog') 133 | 134 | hf = cdflib.CDF(f) 135 | positions = hf['Pose'].reshape(-1, 32, 3) 136 | positions /= 1000 # Meters instead of millimeters 137 | output[subject][canonical_name] = positions.astype('float32') 138 | 139 | print('Saving...') 140 | np.savez_compressed(output_filename, positions_3d=output) 141 | 142 | print('Done.') 143 | 144 | else: 145 | print('Please specify the dataset source') 146 | exit(0) 147 | 148 | # Create 2D pose file 149 | print('') 150 | print('Computing ground-truth 2D poses...') 151 | dataset = Human36mDataset(output_filename + '.npz') 152 | output_2d_poses = {} 153 | for subject in dataset.subjects(): 154 | output_2d_poses[subject] = {} 155 | for action in dataset[subject].keys(): 156 | anim = dataset[subject][action] 157 | 158 | positions_2d = [] 159 | for cam in anim['cameras']: 160 | pos_3d = world_to_camera(anim['positions'], R=cam['orientation'], t=cam['translation']) 161 | pos_2d = wrap(project_to_2d, pos_3d, cam['intrinsic'], unsqueeze=True) 162 | pos_2d_pixel_space = image_coordinates(pos_2d, w=cam['res_w'], h=cam['res_h']) 163 | positions_2d.append(pos_2d_pixel_space.astype('float32')) 164 | output_2d_poses[subject][action] = positions_2d 165 | 166 | print('Saving...') 167 | metadata = { 168 | 'num_joints': dataset.skeleton().num_joints(), 169 | 'keypoints_symmetry': [dataset.skeleton().joints_left(), dataset.skeleton().joints_right()] 170 | } 171 | np.savez_compressed(output_filename_2d, positions_2d=output_2d_poses, metadata=metadata) 172 | 173 | print('Done.') 174 | -------------------------------------------------------------------------------- /data/prepare_data_humaneva.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2018-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | # 7 | 8 | import argparse 9 | import os 10 | import zipfile 11 | import numpy as np 12 | import h5py 13 | import re 14 | from glob import glob 15 | from shutil import rmtree 16 | from data_utils import suggest_metadata, suggest_pose_importer 17 | 18 | import sys 19 | sys.path.append('../') 20 | from common.utils import wrap 21 | from itertools import groupby 22 | 23 | subjects = ['Train/S1', 'Train/S2', 'Train/S3', 'Validate/S1', 'Validate/S2', 'Validate/S3'] 24 | 25 | cam_map = { 26 | 'C1': 0, 27 | 'C2': 1, 28 | 'C3': 2, 29 | } 30 | 31 | # Frame numbers for train/test split 32 | # format: [start_frame, end_frame[ (inclusive, exclusive) 33 | index = { 34 | 'Train/S1': { 35 | 'Walking 1': (590, 1203), 36 | 'Jog 1': (367, 740), 37 | 'ThrowCatch 1': (473, 945), 38 | 'Gestures 1': (395, 801), 39 | 'Box 1': (385, 789), 40 | }, 41 | 'Train/S2': { 42 | 'Walking 1': (438, 876), 43 | 'Jog 1': (398, 795), 44 | 'ThrowCatch 1': (550, 1128), 45 | 'Gestures 1': (500, 901), 46 | 'Box 1': (382, 734), 47 | }, 48 | 'Train/S3': { 49 | 'Walking 1': (448, 939), 50 | 'Jog 1': (401, 842), 51 | 'ThrowCatch 1': (493, 1027), 52 | 'Gestures 1': (533, 1102), 53 | 'Box 1': (512, 1021), 54 | }, 55 | 'Validate/S1': { 56 | 'Walking 1': (5, 590), 57 | 'Jog 1': (5, 367), 58 | 'ThrowCatch 1': (5, 473), 59 | 'Gestures 1': (5, 395), 60 | 'Box 1': (5, 385), 61 | }, 62 | 'Validate/S2': { 63 | 'Walking 1': (5, 438), 64 | 'Jog 1': (5, 398), 65 | 'ThrowCatch 1': (5, 550), 66 | 'Gestures 1': (5, 500), 67 | 'Box 1': (5, 382), 68 | }, 69 | 'Validate/S3': { 70 | 'Walking 1': (5, 448), 71 | 'Jog 1': (5, 401), 72 | 'ThrowCatch 1': (5, 493), 73 | 'Gestures 1': (5, 533), 74 | 'Box 1': (5, 512), 75 | }, 76 | } 77 | 78 | # Frames to skip for each video (synchronization) 79 | sync_data = { 80 | 'S1': { 81 | 'Walking 1': (82, 81, 82), 82 | 'Jog 1': (51, 51, 50), 83 | 'ThrowCatch 1': (61, 61, 60), 84 | 'Gestures 1': (45, 45, 44), 85 | 'Box 1': (57, 57, 56), 86 | }, 87 | 'S2': { 88 | 'Walking 1': (115, 115, 114), 89 | 'Jog 1': (100, 100, 99), 90 | 'ThrowCatch 1': (127, 127, 127), 91 | 'Gestures 1': (122, 122, 121), 92 | 'Box 1': (119, 119, 117), 93 | }, 94 | 'S3': { 95 | 'Walking 1': (80, 80, 80), 96 | 'Jog 1': (65, 65, 65), 97 | 'ThrowCatch 1': (79, 79, 79), 98 | 'Gestures 1': (83, 83, 82), 99 | 'Box 1': (1, 1, 1), 100 | }, 101 | 'S4': {} 102 | } 103 | 104 | if __name__ == '__main__': 105 | if os.path.basename(os.getcwd()) != 'data': 106 | print('This script must be launched from the "data" directory') 107 | exit(0) 108 | 109 | parser = argparse.ArgumentParser(description='HumanEva dataset converter') 110 | 111 | parser.add_argument('-p', '--path', default='', type=str, metavar='PATH', help='path to the processed HumanEva dataset') 112 | parser.add_argument('--convert-3d', action='store_true', help='convert 3D mocap data') 113 | parser.add_argument('--convert-2d', default='', type=str, metavar='PATH', help='convert user-supplied 2D detections') 114 | parser.add_argument('-o', '--output', default='', type=str, metavar='PATH', help='output suffix for 2D detections (e.g. detectron_pt_coco)') 115 | 116 | args = parser.parse_args() 117 | 118 | if not args.convert_2d and not args.convert_3d: 119 | print('Please specify one conversion mode') 120 | exit(0) 121 | 122 | 123 | if args.path: 124 | print('Parsing HumanEva dataset from', args.path) 125 | output = {} 126 | output_2d = {} 127 | frame_mapping = {} 128 | 129 | from scipy.io import loadmat 130 | 131 | num_joints = None 132 | 133 | for subject in subjects: 134 | output[subject] = {} 135 | output_2d[subject] = {} 136 | split, subject_name = subject.split('/') 137 | if subject_name not in frame_mapping: 138 | frame_mapping[subject_name] = {} 139 | 140 | file_list = glob(args.path + '/' + subject + '/*.mat') 141 | for f in file_list: 142 | action = os.path.splitext(os.path.basename(f))[0] 143 | 144 | # Use consistent naming convention 145 | canonical_name = action.replace('_', ' ') 146 | 147 | hf = loadmat(f) 148 | positions = hf['poses_3d'] 149 | positions_2d = hf['poses_2d'].transpose(1, 0, 2, 3) # Ground-truth 2D poses 150 | assert positions.shape[0] == positions_2d.shape[0] and positions.shape[1] == positions_2d.shape[2] 151 | assert num_joints is None or num_joints == positions.shape[1], "Joint number inconsistency among files" 152 | num_joints = positions.shape[1] 153 | 154 | # Sanity check for the sequence length 155 | assert positions.shape[0] == index[subject][canonical_name][1] - index[subject][canonical_name][0] 156 | 157 | # Split corrupted motion capture streams into contiguous chunks 158 | # e.g. 012XX567X9 is split into "012", "567", and "9". 159 | all_chunks = [list(v) for k, v in groupby(positions, lambda x: np.isfinite(x).all())] 160 | all_chunks_2d = [list(v) for k, v in groupby(positions_2d, lambda x: np.isfinite(x).all())] 161 | assert len(all_chunks) == len(all_chunks_2d) 162 | current_index = index[subject][canonical_name][0] 163 | chunk_indices = [] 164 | for i, chunk in enumerate(all_chunks): 165 | next_index = current_index + len(chunk) 166 | name = canonical_name + ' chunk' + str(i) 167 | if np.isfinite(chunk).all(): 168 | output[subject][name] = np.array(chunk, dtype='float32') / 1000 169 | output_2d[subject][name] = list(np.array(all_chunks_2d[i], dtype='float32').transpose(1, 0, 2, 3)) 170 | chunk_indices.append((current_index, next_index, np.isfinite(chunk).all(), split, name)) 171 | current_index = next_index 172 | assert current_index == index[subject][canonical_name][1] 173 | if canonical_name not in frame_mapping[subject_name]: 174 | frame_mapping[subject_name][canonical_name] = [] 175 | frame_mapping[subject_name][canonical_name] += chunk_indices 176 | 177 | metadata = suggest_metadata('humaneva' + str(num_joints)) 178 | output_filename = 'data_3d_' + metadata['layout_name'] 179 | output_prefix_2d = 'data_2d_' + metadata['layout_name'] + '_' 180 | 181 | if args.convert_3d: 182 | print('Saving...') 183 | np.savez_compressed(output_filename, positions_3d=output) 184 | np.savez_compressed(output_prefix_2d + 'gt', positions_2d=output_2d, metadata=metadata) 185 | print('Done.') 186 | 187 | else: 188 | print('Please specify the dataset source') 189 | exit(0) 190 | 191 | if args.convert_2d: 192 | if not args.output: 193 | print('Please specify an output suffix (e.g. detectron_pt_coco)') 194 | exit(0) 195 | 196 | import_func = suggest_pose_importer(args.output) 197 | metadata = suggest_metadata(args.output) 198 | 199 | print('Parsing 2D detections from', args.convert_2d) 200 | 201 | output = {} 202 | file_list = glob(args.convert_2d + '/S*/*.avi.npz') 203 | for f in file_list: 204 | path, fname = os.path.split(f) 205 | subject = os.path.basename(path) 206 | assert subject.startswith('S'), subject + ' does not look like a subject directory' 207 | 208 | m = re.search('(.*) \\((.*)\\)', fname.replace('_', ' ')) 209 | action = m.group(1) 210 | camera = m.group(2) 211 | camera_idx = cam_map[camera] 212 | 213 | keypoints = import_func(f) 214 | assert keypoints.shape[1] == metadata['num_joints'] 215 | 216 | if action in sync_data[subject]: 217 | sync_offset = sync_data[subject][action][camera_idx] - 1 218 | else: 219 | sync_offset = 0 220 | 221 | if subject in frame_mapping and action in frame_mapping[subject]: 222 | chunks = frame_mapping[subject][action] 223 | for (start_idx, end_idx, labeled, split, name) in chunks: 224 | canonical_subject = split + '/' + subject 225 | if not labeled: 226 | canonical_subject = 'Unlabeled/' + canonical_subject 227 | if canonical_subject not in output: 228 | output[canonical_subject] = {} 229 | kps = keypoints[start_idx+sync_offset:end_idx+sync_offset] 230 | assert len(kps) == end_idx - start_idx, "Got len {}, expected {}".format(len(kps), end_idx - start_idx) 231 | 232 | if name not in output[canonical_subject]: 233 | output[canonical_subject][name] = [None, None, None] 234 | 235 | output[canonical_subject][name][camera_idx] = kps.astype('float32') 236 | else: 237 | canonical_subject = 'Unlabeled/' + subject 238 | if canonical_subject not in output: 239 | output[canonical_subject] = {} 240 | if action not in output[canonical_subject]: 241 | output[canonical_subject][action] = [None, None, None] 242 | output[canonical_subject][action][camera_idx] = keypoints.astype('float32') 243 | 244 | print('Saving...') 245 | np.savez_compressed(output_prefix_2d + args.output, positions_2d=output, metadata=metadata) 246 | print('Done.') -------------------------------------------------------------------------------- /images/batching.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/facebookresearch/VideoPose3D/1afb1ca0f1237776518469876342fc8669d3f6a9/images/batching.png -------------------------------------------------------------------------------- /images/convolutions_1f_naive.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/facebookresearch/VideoPose3D/1afb1ca0f1237776518469876342fc8669d3f6a9/images/convolutions_1f_naive.png -------------------------------------------------------------------------------- /images/convolutions_1f_optimized.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/facebookresearch/VideoPose3D/1afb1ca0f1237776518469876342fc8669d3f6a9/images/convolutions_1f_optimized.png -------------------------------------------------------------------------------- /images/convolutions_anim.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/facebookresearch/VideoPose3D/1afb1ca0f1237776518469876342fc8669d3f6a9/images/convolutions_anim.gif -------------------------------------------------------------------------------- /images/convolutions_causal.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/facebookresearch/VideoPose3D/1afb1ca0f1237776518469876342fc8669d3f6a9/images/convolutions_causal.png -------------------------------------------------------------------------------- /images/convolutions_normal.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/facebookresearch/VideoPose3D/1afb1ca0f1237776518469876342fc8669d3f6a9/images/convolutions_normal.png -------------------------------------------------------------------------------- /images/demo_h36m.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/facebookresearch/VideoPose3D/1afb1ca0f1237776518469876342fc8669d3f6a9/images/demo_h36m.gif -------------------------------------------------------------------------------- /images/demo_humaneva.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/facebookresearch/VideoPose3D/1afb1ca0f1237776518469876342fc8669d3f6a9/images/demo_humaneva.gif -------------------------------------------------------------------------------- /images/demo_humaneva_unlabeled.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/facebookresearch/VideoPose3D/1afb1ca0f1237776518469876342fc8669d3f6a9/images/demo_humaneva_unlabeled.gif -------------------------------------------------------------------------------- /images/demo_temporal.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/facebookresearch/VideoPose3D/1afb1ca0f1237776518469876342fc8669d3f6a9/images/demo_temporal.gif -------------------------------------------------------------------------------- /images/demo_yt.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/facebookresearch/VideoPose3D/1afb1ca0f1237776518469876342fc8669d3f6a9/images/demo_yt.gif -------------------------------------------------------------------------------- /inference/infer_video.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2018-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | # 7 | 8 | """Perform inference on a single video or all videos with a certain extension 9 | (e.g., .mp4) in a folder. 10 | """ 11 | 12 | from infer_simple import * 13 | import subprocess as sp 14 | import numpy as np 15 | 16 | def get_resolution(filename): 17 | command = ['ffprobe', '-v', 'error', '-select_streams', 'v:0', 18 | '-show_entries', 'stream=width,height', '-of', 'csv=p=0', filename] 19 | pipe = sp.Popen(command, stdout=sp.PIPE, bufsize=-1) 20 | for line in pipe.stdout: 21 | w, h = line.decode().strip().split(',') 22 | return int(w), int(h) 23 | 24 | def read_video(filename): 25 | w, h = get_resolution(filename) 26 | 27 | command = ['ffmpeg', 28 | '-i', filename, 29 | '-f', 'image2pipe', 30 | '-pix_fmt', 'bgr24', 31 | '-vsync', '0', 32 | '-vcodec', 'rawvideo', '-'] 33 | 34 | pipe = sp.Popen(command, stdout=sp.PIPE, bufsize=-1) 35 | while True: 36 | data = pipe.stdout.read(w*h*3) 37 | if not data: 38 | break 39 | yield np.frombuffer(data, dtype='uint8').reshape((h, w, 3)) 40 | 41 | 42 | def main(args): 43 | 44 | logger = logging.getLogger(__name__) 45 | merge_cfg_from_file(args.cfg) 46 | cfg.NUM_GPUS = 1 47 | args.weights = cache_url(args.weights, cfg.DOWNLOAD_CACHE) 48 | assert_and_infer_cfg(cache_urls=False) 49 | model = infer_engine.initialize_model_from_cfg(args.weights) 50 | dummy_coco_dataset = dummy_datasets.get_coco_dataset() 51 | 52 | 53 | 54 | if os.path.isdir(args.im_or_folder): 55 | im_list = glob.iglob(args.im_or_folder + '/*.' + args.image_ext) 56 | else: 57 | im_list = [args.im_or_folder] 58 | 59 | for video_name in im_list: 60 | out_name = os.path.join( 61 | args.output_dir, os.path.basename(video_name) 62 | ) 63 | print('Processing {}'.format(video_name)) 64 | 65 | boxes = [] 66 | segments = [] 67 | keypoints = [] 68 | 69 | for frame_i, im in enumerate(read_video(video_name)): 70 | 71 | logger.info('Frame {}'.format(frame_i)) 72 | timers = defaultdict(Timer) 73 | t = time.time() 74 | with c2_utils.NamedCudaScope(0): 75 | cls_boxes, cls_segms, cls_keyps = infer_engine.im_detect_all( 76 | model, im, None, timers=timers 77 | ) 78 | logger.info('Inference time: {:.3f}s'.format(time.time() - t)) 79 | for k, v in timers.items(): 80 | logger.info(' | {}: {:.3f}s'.format(k, v.average_time)) 81 | 82 | boxes.append(cls_boxes) 83 | segments.append(cls_segms) 84 | keypoints.append(cls_keyps) 85 | 86 | 87 | # Video resolution 88 | metadata = { 89 | 'w': im.shape[1], 90 | 'h': im.shape[0], 91 | } 92 | 93 | np.savez_compressed(out_name, boxes=boxes, segments=segments, keypoints=keypoints, metadata=metadata) 94 | 95 | 96 | if __name__ == '__main__': 97 | workspace.GlobalInit(['caffe2', '--caffe2_log_level=0']) 98 | setup_logging(__name__) 99 | args = parse_args() 100 | main(args) 101 | -------------------------------------------------------------------------------- /inference/infer_video_d2.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2018-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | # 7 | 8 | """Perform inference on a single video or all videos with a certain extension 9 | (e.g., .mp4) in a folder. 10 | """ 11 | 12 | import detectron2 13 | from detectron2.utils.logger import setup_logger 14 | from detectron2.config import get_cfg 15 | from detectron2 import model_zoo 16 | from detectron2.engine import DefaultPredictor 17 | 18 | import subprocess as sp 19 | import numpy as np 20 | import time 21 | import argparse 22 | import sys 23 | import os 24 | import glob 25 | 26 | def parse_args(): 27 | parser = argparse.ArgumentParser(description='End-to-end inference') 28 | parser.add_argument( 29 | '--cfg', 30 | dest='cfg', 31 | help='cfg model file (/path/to/model_config.yaml)', 32 | default=None, 33 | type=str 34 | ) 35 | parser.add_argument( 36 | '--output-dir', 37 | dest='output_dir', 38 | help='directory for visualization pdfs (default: /tmp/infer_simple)', 39 | default='/tmp/infer_simple', 40 | type=str 41 | ) 42 | parser.add_argument( 43 | '--image-ext', 44 | dest='image_ext', 45 | help='image file name extension (default: mp4)', 46 | default='mp4', 47 | type=str 48 | ) 49 | parser.add_argument( 50 | 'im_or_folder', help='image or folder of images', default=None 51 | ) 52 | if len(sys.argv) == 1: 53 | parser.print_help() 54 | sys.exit(1) 55 | return parser.parse_args() 56 | 57 | def get_resolution(filename): 58 | command = ['ffprobe', '-v', 'error', '-select_streams', 'v:0', 59 | '-show_entries', 'stream=width,height', '-of', 'csv=p=0', filename] 60 | pipe = sp.Popen(command, stdout=sp.PIPE, bufsize=-1) 61 | for line in pipe.stdout: 62 | w, h = line.decode().strip().split(',') 63 | return int(w), int(h) 64 | 65 | def read_video(filename): 66 | w, h = get_resolution(filename) 67 | 68 | command = ['ffmpeg', 69 | '-i', filename, 70 | '-f', 'image2pipe', 71 | '-pix_fmt', 'bgr24', 72 | '-vsync', '0', 73 | '-vcodec', 'rawvideo', '-'] 74 | 75 | pipe = sp.Popen(command, stdout=sp.PIPE, bufsize=-1) 76 | while True: 77 | data = pipe.stdout.read(w*h*3) 78 | if not data: 79 | break 80 | yield np.frombuffer(data, dtype='uint8').reshape((h, w, 3)) 81 | 82 | 83 | def main(args): 84 | 85 | cfg = get_cfg() 86 | cfg.merge_from_file(model_zoo.get_config_file(args.cfg)) 87 | cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.7 88 | cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url(args.cfg) 89 | predictor = DefaultPredictor(cfg) 90 | 91 | 92 | if os.path.isdir(args.im_or_folder): 93 | im_list = glob.iglob(args.im_or_folder + '/*.' + args.image_ext) 94 | else: 95 | im_list = [args.im_or_folder] 96 | 97 | for video_name in im_list: 98 | out_name = os.path.join( 99 | args.output_dir, os.path.basename(video_name) 100 | ) 101 | print('Processing {}'.format(video_name)) 102 | 103 | boxes = [] 104 | segments = [] 105 | keypoints = [] 106 | 107 | for frame_i, im in enumerate(read_video(video_name)): 108 | t = time.time() 109 | outputs = predictor(im)['instances'].to('cpu') 110 | 111 | print('Frame {} processed in {:.3f}s'.format(frame_i, time.time() - t)) 112 | 113 | has_bbox = False 114 | if outputs.has('pred_boxes'): 115 | bbox_tensor = outputs.pred_boxes.tensor.numpy() 116 | if len(bbox_tensor) > 0: 117 | has_bbox = True 118 | scores = outputs.scores.numpy()[:, None] 119 | bbox_tensor = np.concatenate((bbox_tensor, scores), axis=1) 120 | if has_bbox: 121 | kps = outputs.pred_keypoints.numpy() 122 | kps_xy = kps[:, :, :2] 123 | kps_prob = kps[:, :, 2:3] 124 | kps_logit = np.zeros_like(kps_prob) # Dummy 125 | kps = np.concatenate((kps_xy, kps_logit, kps_prob), axis=2) 126 | kps = kps.transpose(0, 2, 1) 127 | else: 128 | kps = [] 129 | bbox_tensor = [] 130 | 131 | # Mimic Detectron1 format 132 | cls_boxes = [[], bbox_tensor] 133 | cls_keyps = [[], kps] 134 | 135 | boxes.append(cls_boxes) 136 | segments.append(None) 137 | keypoints.append(cls_keyps) 138 | 139 | 140 | # Video resolution 141 | metadata = { 142 | 'w': im.shape[1], 143 | 'h': im.shape[0], 144 | } 145 | 146 | np.savez_compressed(out_name, boxes=boxes, segments=segments, keypoints=keypoints, metadata=metadata) 147 | 148 | 149 | if __name__ == '__main__': 150 | setup_logger() 151 | args = parse_args() 152 | main(args) 153 | -------------------------------------------------------------------------------- /run.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2018-present, Facebook, Inc. 2 | # All rights reserved. 3 | # 4 | # This source code is licensed under the license found in the 5 | # LICENSE file in the root directory of this source tree. 6 | # 7 | 8 | import numpy as np 9 | 10 | from common.arguments import parse_args 11 | import torch 12 | 13 | import torch.nn as nn 14 | import torch.nn.functional as F 15 | import torch.optim as optim 16 | import os 17 | import sys 18 | import errno 19 | 20 | from common.camera import * 21 | from common.model import * 22 | from common.loss import * 23 | from common.generators import ChunkedGenerator, UnchunkedGenerator 24 | from time import time 25 | from common.utils import deterministic_random 26 | 27 | args = parse_args() 28 | print(args) 29 | 30 | try: 31 | # Create checkpoint directory if it does not exist 32 | os.makedirs(args.checkpoint) 33 | except OSError as e: 34 | if e.errno != errno.EEXIST: 35 | raise RuntimeError('Unable to create checkpoint directory:', args.checkpoint) 36 | 37 | print('Loading dataset...') 38 | dataset_path = 'data/data_3d_' + args.dataset + '.npz' 39 | if args.dataset == 'h36m': 40 | from common.h36m_dataset import Human36mDataset 41 | dataset = Human36mDataset(dataset_path) 42 | elif args.dataset.startswith('humaneva'): 43 | from common.humaneva_dataset import HumanEvaDataset 44 | dataset = HumanEvaDataset(dataset_path) 45 | elif args.dataset.startswith('custom'): 46 | from common.custom_dataset import CustomDataset 47 | dataset = CustomDataset('data/data_2d_' + args.dataset + '_' + args.keypoints + '.npz') 48 | else: 49 | raise KeyError('Invalid dataset') 50 | 51 | print('Preparing data...') 52 | for subject in dataset.subjects(): 53 | for action in dataset[subject].keys(): 54 | anim = dataset[subject][action] 55 | 56 | if 'positions' in anim: 57 | positions_3d = [] 58 | for cam in anim['cameras']: 59 | pos_3d = world_to_camera(anim['positions'], R=cam['orientation'], t=cam['translation']) 60 | pos_3d[:, 1:] -= pos_3d[:, :1] # Remove global offset, but keep trajectory in first position 61 | positions_3d.append(pos_3d) 62 | anim['positions_3d'] = positions_3d 63 | 64 | print('Loading 2D detections...') 65 | keypoints = np.load('data/data_2d_' + args.dataset + '_' + args.keypoints + '.npz', allow_pickle=True) 66 | keypoints_metadata = keypoints['metadata'].item() 67 | keypoints_symmetry = keypoints_metadata['keypoints_symmetry'] 68 | kps_left, kps_right = list(keypoints_symmetry[0]), list(keypoints_symmetry[1]) 69 | joints_left, joints_right = list(dataset.skeleton().joints_left()), list(dataset.skeleton().joints_right()) 70 | keypoints = keypoints['positions_2d'].item() 71 | 72 | for subject in dataset.subjects(): 73 | assert subject in keypoints, 'Subject {} is missing from the 2D detections dataset'.format(subject) 74 | for action in dataset[subject].keys(): 75 | assert action in keypoints[subject], 'Action {} of subject {} is missing from the 2D detections dataset'.format(action, subject) 76 | if 'positions_3d' not in dataset[subject][action]: 77 | continue 78 | 79 | for cam_idx in range(len(keypoints[subject][action])): 80 | 81 | # We check for >= instead of == because some videos in H3.6M contain extra frames 82 | mocap_length = dataset[subject][action]['positions_3d'][cam_idx].shape[0] 83 | assert keypoints[subject][action][cam_idx].shape[0] >= mocap_length 84 | 85 | if keypoints[subject][action][cam_idx].shape[0] > mocap_length: 86 | # Shorten sequence 87 | keypoints[subject][action][cam_idx] = keypoints[subject][action][cam_idx][:mocap_length] 88 | 89 | assert len(keypoints[subject][action]) == len(dataset[subject][action]['positions_3d']) 90 | 91 | for subject in keypoints.keys(): 92 | for action in keypoints[subject]: 93 | for cam_idx, kps in enumerate(keypoints[subject][action]): 94 | # Normalize camera frame 95 | cam = dataset.cameras()[subject][cam_idx] 96 | kps[..., :2] = normalize_screen_coordinates(kps[..., :2], w=cam['res_w'], h=cam['res_h']) 97 | keypoints[subject][action][cam_idx] = kps 98 | 99 | subjects_train = args.subjects_train.split(',') 100 | subjects_semi = [] if not args.subjects_unlabeled else args.subjects_unlabeled.split(',') 101 | if not args.render: 102 | subjects_test = args.subjects_test.split(',') 103 | else: 104 | subjects_test = [args.viz_subject] 105 | 106 | semi_supervised = len(subjects_semi) > 0 107 | if semi_supervised and not dataset.supports_semi_supervised(): 108 | raise RuntimeError('Semi-supervised training is not implemented for this dataset') 109 | 110 | def fetch(subjects, action_filter=None, subset=1, parse_3d_poses=True): 111 | out_poses_3d = [] 112 | out_poses_2d = [] 113 | out_camera_params = [] 114 | for subject in subjects: 115 | for action in keypoints[subject].keys(): 116 | if action_filter is not None: 117 | found = False 118 | for a in action_filter: 119 | if action.startswith(a): 120 | found = True 121 | break 122 | if not found: 123 | continue 124 | 125 | poses_2d = keypoints[subject][action] 126 | for i in range(len(poses_2d)): # Iterate across cameras 127 | out_poses_2d.append(poses_2d[i]) 128 | 129 | if subject in dataset.cameras(): 130 | cams = dataset.cameras()[subject] 131 | assert len(cams) == len(poses_2d), 'Camera count mismatch' 132 | for cam in cams: 133 | if 'intrinsic' in cam: 134 | out_camera_params.append(cam['intrinsic']) 135 | 136 | if parse_3d_poses and 'positions_3d' in dataset[subject][action]: 137 | poses_3d = dataset[subject][action]['positions_3d'] 138 | assert len(poses_3d) == len(poses_2d), 'Camera count mismatch' 139 | for i in range(len(poses_3d)): # Iterate across cameras 140 | out_poses_3d.append(poses_3d[i]) 141 | 142 | if len(out_camera_params) == 0: 143 | out_camera_params = None 144 | if len(out_poses_3d) == 0: 145 | out_poses_3d = None 146 | 147 | stride = args.downsample 148 | if subset < 1: 149 | for i in range(len(out_poses_2d)): 150 | n_frames = int(round(len(out_poses_2d[i])//stride * subset)*stride) 151 | start = deterministic_random(0, len(out_poses_2d[i]) - n_frames + 1, str(len(out_poses_2d[i]))) 152 | out_poses_2d[i] = out_poses_2d[i][start:start+n_frames:stride] 153 | if out_poses_3d is not None: 154 | out_poses_3d[i] = out_poses_3d[i][start:start+n_frames:stride] 155 | elif stride > 1: 156 | # Downsample as requested 157 | for i in range(len(out_poses_2d)): 158 | out_poses_2d[i] = out_poses_2d[i][::stride] 159 | if out_poses_3d is not None: 160 | out_poses_3d[i] = out_poses_3d[i][::stride] 161 | 162 | 163 | return out_camera_params, out_poses_3d, out_poses_2d 164 | 165 | action_filter = None if args.actions == '*' else args.actions.split(',') 166 | if action_filter is not None: 167 | print('Selected actions:', action_filter) 168 | 169 | cameras_valid, poses_valid, poses_valid_2d = fetch(subjects_test, action_filter) 170 | 171 | filter_widths = [int(x) for x in args.architecture.split(',')] 172 | if not args.disable_optimizations and not args.dense and args.stride == 1: 173 | # Use optimized model for single-frame predictions 174 | model_pos_train = TemporalModelOptimized1f(poses_valid_2d[0].shape[-2], poses_valid_2d[0].shape[-1], dataset.skeleton().num_joints(), 175 | filter_widths=filter_widths, causal=args.causal, dropout=args.dropout, channels=args.channels) 176 | else: 177 | # When incompatible settings are detected (stride > 1, dense filters, or disabled optimization) fall back to normal model 178 | model_pos_train = TemporalModel(poses_valid_2d[0].shape[-2], poses_valid_2d[0].shape[-1], dataset.skeleton().num_joints(), 179 | filter_widths=filter_widths, causal=args.causal, dropout=args.dropout, channels=args.channels, 180 | dense=args.dense) 181 | 182 | model_pos = TemporalModel(poses_valid_2d[0].shape[-2], poses_valid_2d[0].shape[-1], dataset.skeleton().num_joints(), 183 | filter_widths=filter_widths, causal=args.causal, dropout=args.dropout, channels=args.channels, 184 | dense=args.dense) 185 | 186 | receptive_field = model_pos.receptive_field() 187 | print('INFO: Receptive field: {} frames'.format(receptive_field)) 188 | pad = (receptive_field - 1) // 2 # Padding on each side 189 | if args.causal: 190 | print('INFO: Using causal convolutions') 191 | causal_shift = pad 192 | else: 193 | causal_shift = 0 194 | 195 | model_params = 0 196 | for parameter in model_pos.parameters(): 197 | model_params += parameter.numel() 198 | print('INFO: Trainable parameter count:', model_params) 199 | 200 | if torch.cuda.is_available(): 201 | model_pos = model_pos.cuda() 202 | model_pos_train = model_pos_train.cuda() 203 | 204 | if args.resume or args.evaluate: 205 | chk_filename = os.path.join(args.checkpoint, args.resume if args.resume else args.evaluate) 206 | print('Loading checkpoint', chk_filename) 207 | checkpoint = torch.load(chk_filename, map_location=lambda storage, loc: storage) 208 | print('This model was trained for {} epochs'.format(checkpoint['epoch'])) 209 | model_pos_train.load_state_dict(checkpoint['model_pos']) 210 | model_pos.load_state_dict(checkpoint['model_pos']) 211 | 212 | if args.evaluate and 'model_traj' in checkpoint: 213 | # Load trajectory model if it contained in the checkpoint (e.g. for inference in the wild) 214 | model_traj = TemporalModel(poses_valid_2d[0].shape[-2], poses_valid_2d[0].shape[-1], 1, 215 | filter_widths=filter_widths, causal=args.causal, dropout=args.dropout, channels=args.channels, 216 | dense=args.dense) 217 | if torch.cuda.is_available(): 218 | model_traj = model_traj.cuda() 219 | model_traj.load_state_dict(checkpoint['model_traj']) 220 | else: 221 | model_traj = None 222 | 223 | 224 | test_generator = UnchunkedGenerator(cameras_valid, poses_valid, poses_valid_2d, 225 | pad=pad, causal_shift=causal_shift, augment=False, 226 | kps_left=kps_left, kps_right=kps_right, joints_left=joints_left, joints_right=joints_right) 227 | print('INFO: Testing on {} frames'.format(test_generator.num_frames())) 228 | 229 | if not args.evaluate: 230 | cameras_train, poses_train, poses_train_2d = fetch(subjects_train, action_filter, subset=args.subset) 231 | 232 | lr = args.learning_rate 233 | if semi_supervised: 234 | cameras_semi, _, poses_semi_2d = fetch(subjects_semi, action_filter, parse_3d_poses=False) 235 | 236 | if not args.disable_optimizations and not args.dense and args.stride == 1: 237 | # Use optimized model for single-frame predictions 238 | model_traj_train = TemporalModelOptimized1f(poses_valid_2d[0].shape[-2], poses_valid_2d[0].shape[-1], 1, 239 | filter_widths=filter_widths, causal=args.causal, dropout=args.dropout, channels=args.channels) 240 | else: 241 | # When incompatible settings are detected (stride > 1, dense filters, or disabled optimization) fall back to normal model 242 | model_traj_train = TemporalModel(poses_valid_2d[0].shape[-2], poses_valid_2d[0].shape[-1], 1, 243 | filter_widths=filter_widths, causal=args.causal, dropout=args.dropout, channels=args.channels, 244 | dense=args.dense) 245 | 246 | model_traj = TemporalModel(poses_valid_2d[0].shape[-2], poses_valid_2d[0].shape[-1], 1, 247 | filter_widths=filter_widths, causal=args.causal, dropout=args.dropout, channels=args.channels, 248 | dense=args.dense) 249 | if torch.cuda.is_available(): 250 | model_traj = model_traj.cuda() 251 | model_traj_train = model_traj_train.cuda() 252 | optimizer = optim.Adam(list(model_pos_train.parameters()) + list(model_traj_train.parameters()), 253 | lr=lr, amsgrad=True) 254 | 255 | losses_2d_train_unlabeled = [] 256 | losses_2d_train_labeled_eval = [] 257 | losses_2d_train_unlabeled_eval = [] 258 | losses_2d_valid = [] 259 | 260 | losses_traj_train = [] 261 | losses_traj_train_eval = [] 262 | losses_traj_valid = [] 263 | else: 264 | optimizer = optim.Adam(model_pos_train.parameters(), lr=lr, amsgrad=True) 265 | 266 | lr_decay = args.lr_decay 267 | 268 | losses_3d_train = [] 269 | losses_3d_train_eval = [] 270 | losses_3d_valid = [] 271 | 272 | epoch = 0 273 | initial_momentum = 0.1 274 | final_momentum = 0.001 275 | 276 | 277 | train_generator = ChunkedGenerator(args.batch_size//args.stride, cameras_train, poses_train, poses_train_2d, args.stride, 278 | pad=pad, causal_shift=causal_shift, shuffle=True, augment=args.data_augmentation, 279 | kps_left=kps_left, kps_right=kps_right, joints_left=joints_left, joints_right=joints_right) 280 | train_generator_eval = UnchunkedGenerator(cameras_train, poses_train, poses_train_2d, 281 | pad=pad, causal_shift=causal_shift, augment=False) 282 | print('INFO: Training on {} frames'.format(train_generator_eval.num_frames())) 283 | if semi_supervised: 284 | semi_generator = ChunkedGenerator(args.batch_size//args.stride, cameras_semi, None, poses_semi_2d, args.stride, 285 | pad=pad, causal_shift=causal_shift, shuffle=True, 286 | random_seed=4321, augment=args.data_augmentation, 287 | kps_left=kps_left, kps_right=kps_right, joints_left=joints_left, joints_right=joints_right, 288 | endless=True) 289 | semi_generator_eval = UnchunkedGenerator(cameras_semi, None, poses_semi_2d, 290 | pad=pad, causal_shift=causal_shift, augment=False) 291 | print('INFO: Semi-supervision on {} frames'.format(semi_generator_eval.num_frames())) 292 | 293 | if args.resume: 294 | epoch = checkpoint['epoch'] 295 | if 'optimizer' in checkpoint and checkpoint['optimizer'] is not None: 296 | optimizer.load_state_dict(checkpoint['optimizer']) 297 | train_generator.set_random_state(checkpoint['random_state']) 298 | else: 299 | print('WARNING: this checkpoint does not contain an optimizer state. The optimizer will be reinitialized.') 300 | 301 | lr = checkpoint['lr'] 302 | if semi_supervised: 303 | model_traj_train.load_state_dict(checkpoint['model_traj']) 304 | model_traj.load_state_dict(checkpoint['model_traj']) 305 | semi_generator.set_random_state(checkpoint['random_state_semi']) 306 | 307 | print('** Note: reported losses are averaged over all frames and test-time augmentation is not used here.') 308 | print('** The final evaluation will be carried out after the last training epoch.') 309 | 310 | # Pos model only 311 | while epoch < args.epochs: 312 | start_time = time() 313 | epoch_loss_3d_train = 0 314 | epoch_loss_traj_train = 0 315 | epoch_loss_2d_train_unlabeled = 0 316 | N = 0 317 | N_semi = 0 318 | model_pos_train.train() 319 | if semi_supervised: 320 | # Semi-supervised scenario 321 | model_traj_train.train() 322 | for (_, batch_3d, batch_2d), (cam_semi, _, batch_2d_semi) in \ 323 | zip(train_generator.next_epoch(), semi_generator.next_epoch()): 324 | 325 | # Fall back to supervised training for the first epoch (to avoid instability) 326 | skip = epoch < args.warmup 327 | 328 | cam_semi = torch.from_numpy(cam_semi.astype('float32')) 329 | inputs_3d = torch.from_numpy(batch_3d.astype('float32')) 330 | if torch.cuda.is_available(): 331 | cam_semi = cam_semi.cuda() 332 | inputs_3d = inputs_3d.cuda() 333 | 334 | inputs_traj = inputs_3d[:, :, :1].clone() 335 | inputs_3d[:, :, 0] = 0 336 | 337 | # Split point between labeled and unlabeled samples in the batch 338 | split_idx = inputs_3d.shape[0] 339 | 340 | inputs_2d = torch.from_numpy(batch_2d.astype('float32')) 341 | inputs_2d_semi = torch.from_numpy(batch_2d_semi.astype('float32')) 342 | if torch.cuda.is_available(): 343 | inputs_2d = inputs_2d.cuda() 344 | inputs_2d_semi = inputs_2d_semi.cuda() 345 | inputs_2d_cat = torch.cat((inputs_2d, inputs_2d_semi), dim=0) if not skip else inputs_2d 346 | 347 | optimizer.zero_grad() 348 | 349 | # Compute 3D poses 350 | predicted_3d_pos_cat = model_pos_train(inputs_2d_cat) 351 | 352 | loss_3d_pos = mpjpe(predicted_3d_pos_cat[:split_idx], inputs_3d) 353 | epoch_loss_3d_train += inputs_3d.shape[0]*inputs_3d.shape[1] * loss_3d_pos.item() 354 | N += inputs_3d.shape[0]*inputs_3d.shape[1] 355 | loss_total = loss_3d_pos 356 | 357 | # Compute global trajectory 358 | predicted_traj_cat = model_traj_train(inputs_2d_cat) 359 | w = 1 / inputs_traj[:, :, :, 2] # Weight inversely proportional to depth 360 | loss_traj = weighted_mpjpe(predicted_traj_cat[:split_idx], inputs_traj, w) 361 | epoch_loss_traj_train += inputs_3d.shape[0]*inputs_3d.shape[1] * loss_traj.item() 362 | assert inputs_traj.shape[0]*inputs_traj.shape[1] == inputs_3d.shape[0]*inputs_3d.shape[1] 363 | loss_total += loss_traj 364 | 365 | if not skip: 366 | # Semi-supervised loss for unlabeled samples 367 | predicted_semi = predicted_3d_pos_cat[split_idx:] 368 | if pad > 0: 369 | target_semi = inputs_2d_semi[:, pad:-pad, :, :2].contiguous() 370 | else: 371 | target_semi = inputs_2d_semi[:, :, :, :2].contiguous() 372 | 373 | projection_func = project_to_2d_linear if args.linear_projection else project_to_2d 374 | reconstruction_semi = projection_func(predicted_semi + predicted_traj_cat[split_idx:], cam_semi) 375 | 376 | loss_reconstruction = mpjpe(reconstruction_semi, target_semi) # On 2D poses 377 | epoch_loss_2d_train_unlabeled += predicted_semi.shape[0]*predicted_semi.shape[1] * loss_reconstruction.item() 378 | if not args.no_proj: 379 | loss_total += loss_reconstruction 380 | 381 | # Bone length term to enforce kinematic constraints 382 | if args.bone_length_term: 383 | dists = predicted_3d_pos_cat[:, :, 1:] - predicted_3d_pos_cat[:, :, dataset.skeleton().parents()[1:]] 384 | bone_lengths = torch.mean(torch.norm(dists, dim=3), dim=1) 385 | penalty = torch.mean(torch.abs(torch.mean(bone_lengths[:split_idx], dim=0) \ 386 | - torch.mean(bone_lengths[split_idx:], dim=0))) 387 | loss_total += penalty 388 | 389 | 390 | N_semi += predicted_semi.shape[0]*predicted_semi.shape[1] 391 | else: 392 | N_semi += 1 # To avoid division by zero 393 | 394 | loss_total.backward() 395 | 396 | optimizer.step() 397 | losses_traj_train.append(epoch_loss_traj_train / N) 398 | losses_2d_train_unlabeled.append(epoch_loss_2d_train_unlabeled / N_semi) 399 | else: 400 | # Regular supervised scenario 401 | for _, batch_3d, batch_2d in train_generator.next_epoch(): 402 | inputs_3d = torch.from_numpy(batch_3d.astype('float32')) 403 | inputs_2d = torch.from_numpy(batch_2d.astype('float32')) 404 | if torch.cuda.is_available(): 405 | inputs_3d = inputs_3d.cuda() 406 | inputs_2d = inputs_2d.cuda() 407 | inputs_3d[:, :, 0] = 0 408 | 409 | optimizer.zero_grad() 410 | 411 | # Predict 3D poses 412 | predicted_3d_pos = model_pos_train(inputs_2d) 413 | loss_3d_pos = mpjpe(predicted_3d_pos, inputs_3d) 414 | epoch_loss_3d_train += inputs_3d.shape[0]*inputs_3d.shape[1] * loss_3d_pos.item() 415 | N += inputs_3d.shape[0]*inputs_3d.shape[1] 416 | 417 | loss_total = loss_3d_pos 418 | loss_total.backward() 419 | 420 | optimizer.step() 421 | 422 | losses_3d_train.append(epoch_loss_3d_train / N) 423 | 424 | # End-of-epoch evaluation 425 | with torch.no_grad(): 426 | model_pos.load_state_dict(model_pos_train.state_dict()) 427 | model_pos.eval() 428 | if semi_supervised: 429 | model_traj.load_state_dict(model_traj_train.state_dict()) 430 | model_traj.eval() 431 | 432 | epoch_loss_3d_valid = 0 433 | epoch_loss_traj_valid = 0 434 | epoch_loss_2d_valid = 0 435 | N = 0 436 | 437 | if not args.no_eval: 438 | # Evaluate on test set 439 | for cam, batch, batch_2d in test_generator.next_epoch(): 440 | inputs_3d = torch.from_numpy(batch.astype('float32')) 441 | inputs_2d = torch.from_numpy(batch_2d.astype('float32')) 442 | if torch.cuda.is_available(): 443 | inputs_3d = inputs_3d.cuda() 444 | inputs_2d = inputs_2d.cuda() 445 | inputs_traj = inputs_3d[:, :, :1].clone() 446 | inputs_3d[:, :, 0] = 0 447 | 448 | # Predict 3D poses 449 | predicted_3d_pos = model_pos(inputs_2d) 450 | loss_3d_pos = mpjpe(predicted_3d_pos, inputs_3d) 451 | epoch_loss_3d_valid += inputs_3d.shape[0]*inputs_3d.shape[1] * loss_3d_pos.item() 452 | N += inputs_3d.shape[0]*inputs_3d.shape[1] 453 | 454 | if semi_supervised: 455 | cam = torch.from_numpy(cam.astype('float32')) 456 | if torch.cuda.is_available(): 457 | cam = cam.cuda() 458 | 459 | predicted_traj = model_traj(inputs_2d) 460 | loss_traj = mpjpe(predicted_traj, inputs_traj) 461 | epoch_loss_traj_valid += inputs_traj.shape[0]*inputs_traj.shape[1] * loss_traj.item() 462 | assert inputs_traj.shape[0]*inputs_traj.shape[1] == inputs_3d.shape[0]*inputs_3d.shape[1] 463 | 464 | if pad > 0: 465 | target = inputs_2d[:, pad:-pad, :, :2].contiguous() 466 | else: 467 | target = inputs_2d[:, :, :, :2].contiguous() 468 | reconstruction = project_to_2d(predicted_3d_pos + predicted_traj, cam) 469 | loss_reconstruction = mpjpe(reconstruction, target) # On 2D poses 470 | epoch_loss_2d_valid += reconstruction.shape[0]*reconstruction.shape[1] * loss_reconstruction.item() 471 | assert reconstruction.shape[0]*reconstruction.shape[1] == inputs_3d.shape[0]*inputs_3d.shape[1] 472 | 473 | losses_3d_valid.append(epoch_loss_3d_valid / N) 474 | if semi_supervised: 475 | losses_traj_valid.append(epoch_loss_traj_valid / N) 476 | losses_2d_valid.append(epoch_loss_2d_valid / N) 477 | 478 | 479 | # Evaluate on training set, this time in evaluation mode 480 | epoch_loss_3d_train_eval = 0 481 | epoch_loss_traj_train_eval = 0 482 | epoch_loss_2d_train_labeled_eval = 0 483 | N = 0 484 | for cam, batch, batch_2d in train_generator_eval.next_epoch(): 485 | if batch_2d.shape[1] == 0: 486 | # This can only happen when downsampling the dataset 487 | continue 488 | 489 | inputs_3d = torch.from_numpy(batch.astype('float32')) 490 | inputs_2d = torch.from_numpy(batch_2d.astype('float32')) 491 | if torch.cuda.is_available(): 492 | inputs_3d = inputs_3d.cuda() 493 | inputs_2d = inputs_2d.cuda() 494 | inputs_traj = inputs_3d[:, :, :1].clone() 495 | inputs_3d[:, :, 0] = 0 496 | 497 | # Compute 3D poses 498 | predicted_3d_pos = model_pos(inputs_2d) 499 | loss_3d_pos = mpjpe(predicted_3d_pos, inputs_3d) 500 | epoch_loss_3d_train_eval += inputs_3d.shape[0]*inputs_3d.shape[1] * loss_3d_pos.item() 501 | N += inputs_3d.shape[0]*inputs_3d.shape[1] 502 | 503 | if semi_supervised: 504 | cam = torch.from_numpy(cam.astype('float32')) 505 | if torch.cuda.is_available(): 506 | cam = cam.cuda() 507 | predicted_traj = model_traj(inputs_2d) 508 | loss_traj = mpjpe(predicted_traj, inputs_traj) 509 | epoch_loss_traj_train_eval += inputs_traj.shape[0]*inputs_traj.shape[1] * loss_traj.item() 510 | assert inputs_traj.shape[0]*inputs_traj.shape[1] == inputs_3d.shape[0]*inputs_3d.shape[1] 511 | 512 | if pad > 0: 513 | target = inputs_2d[:, pad:-pad, :, :2].contiguous() 514 | else: 515 | target = inputs_2d[:, :, :, :2].contiguous() 516 | reconstruction = project_to_2d(predicted_3d_pos + predicted_traj, cam) 517 | loss_reconstruction = mpjpe(reconstruction, target) 518 | epoch_loss_2d_train_labeled_eval += reconstruction.shape[0]*reconstruction.shape[1] * loss_reconstruction.item() 519 | assert reconstruction.shape[0]*reconstruction.shape[1] == inputs_3d.shape[0]*inputs_3d.shape[1] 520 | 521 | losses_3d_train_eval.append(epoch_loss_3d_train_eval / N) 522 | if semi_supervised: 523 | losses_traj_train_eval.append(epoch_loss_traj_train_eval / N) 524 | losses_2d_train_labeled_eval.append(epoch_loss_2d_train_labeled_eval / N) 525 | 526 | # Evaluate 2D loss on unlabeled training set (in evaluation mode) 527 | epoch_loss_2d_train_unlabeled_eval = 0 528 | N_semi = 0 529 | if semi_supervised: 530 | for cam, _, batch_2d in semi_generator_eval.next_epoch(): 531 | cam = torch.from_numpy(cam.astype('float32')) 532 | inputs_2d_semi = torch.from_numpy(batch_2d.astype('float32')) 533 | if torch.cuda.is_available(): 534 | cam = cam.cuda() 535 | inputs_2d_semi = inputs_2d_semi.cuda() 536 | 537 | predicted_3d_pos_semi = model_pos(inputs_2d_semi) 538 | predicted_traj_semi = model_traj(inputs_2d_semi) 539 | if pad > 0: 540 | target_semi = inputs_2d_semi[:, pad:-pad, :, :2].contiguous() 541 | else: 542 | target_semi = inputs_2d_semi[:, :, :, :2].contiguous() 543 | reconstruction_semi = project_to_2d(predicted_3d_pos_semi + predicted_traj_semi, cam) 544 | loss_reconstruction_semi = mpjpe(reconstruction_semi, target_semi) 545 | 546 | epoch_loss_2d_train_unlabeled_eval += reconstruction_semi.shape[0]*reconstruction_semi.shape[1] \ 547 | * loss_reconstruction_semi.item() 548 | N_semi += reconstruction_semi.shape[0]*reconstruction_semi.shape[1] 549 | losses_2d_train_unlabeled_eval.append(epoch_loss_2d_train_unlabeled_eval / N_semi) 550 | 551 | elapsed = (time() - start_time)/60 552 | 553 | if args.no_eval: 554 | print('[%d] time %.2f lr %f 3d_train %f' % ( 555 | epoch + 1, 556 | elapsed, 557 | lr, 558 | losses_3d_train[-1] * 1000)) 559 | else: 560 | if semi_supervised: 561 | print('[%d] time %.2f lr %f 3d_train %f 3d_eval %f traj_eval %f 3d_valid %f ' 562 | 'traj_valid %f 2d_train_sup %f 2d_train_unsup %f 2d_valid %f' % ( 563 | epoch + 1, 564 | elapsed, 565 | lr, 566 | losses_3d_train[-1] * 1000, 567 | losses_3d_train_eval[-1] * 1000, 568 | losses_traj_train_eval[-1] * 1000, 569 | losses_3d_valid[-1] * 1000, 570 | losses_traj_valid[-1] * 1000, 571 | losses_2d_train_labeled_eval[-1], 572 | losses_2d_train_unlabeled_eval[-1], 573 | losses_2d_valid[-1])) 574 | else: 575 | print('[%d] time %.2f lr %f 3d_train %f 3d_eval %f 3d_valid %f' % ( 576 | epoch + 1, 577 | elapsed, 578 | lr, 579 | losses_3d_train[-1] * 1000, 580 | losses_3d_train_eval[-1] * 1000, 581 | losses_3d_valid[-1] *1000)) 582 | 583 | # Decay learning rate exponentially 584 | lr *= lr_decay 585 | for param_group in optimizer.param_groups: 586 | param_group['lr'] *= lr_decay 587 | epoch += 1 588 | 589 | # Decay BatchNorm momentum 590 | momentum = initial_momentum * np.exp(-epoch/args.epochs * np.log(initial_momentum/final_momentum)) 591 | model_pos_train.set_bn_momentum(momentum) 592 | if semi_supervised: 593 | model_traj_train.set_bn_momentum(momentum) 594 | 595 | # Save checkpoint if necessary 596 | if epoch % args.checkpoint_frequency == 0: 597 | chk_path = os.path.join(args.checkpoint, 'epoch_{}.bin'.format(epoch)) 598 | print('Saving checkpoint to', chk_path) 599 | 600 | torch.save({ 601 | 'epoch': epoch, 602 | 'lr': lr, 603 | 'random_state': train_generator.random_state(), 604 | 'optimizer': optimizer.state_dict(), 605 | 'model_pos': model_pos_train.state_dict(), 606 | 'model_traj': model_traj_train.state_dict() if semi_supervised else None, 607 | 'random_state_semi': semi_generator.random_state() if semi_supervised else None, 608 | }, chk_path) 609 | 610 | # Save training curves after every epoch, as .png images (if requested) 611 | if args.export_training_curves and epoch > 3: 612 | if 'matplotlib' not in sys.modules: 613 | import matplotlib 614 | matplotlib.use('Agg') 615 | import matplotlib.pyplot as plt 616 | 617 | plt.figure() 618 | epoch_x = np.arange(3, len(losses_3d_train)) + 1 619 | plt.plot(epoch_x, losses_3d_train[3:], '--', color='C0') 620 | plt.plot(epoch_x, losses_3d_train_eval[3:], color='C0') 621 | plt.plot(epoch_x, losses_3d_valid[3:], color='C1') 622 | plt.legend(['3d train', '3d train (eval)', '3d valid (eval)']) 623 | plt.ylabel('MPJPE (m)') 624 | plt.xlabel('Epoch') 625 | plt.xlim((3, epoch)) 626 | plt.savefig(os.path.join(args.checkpoint, 'loss_3d.png')) 627 | 628 | if semi_supervised: 629 | plt.figure() 630 | plt.plot(epoch_x, losses_traj_train[3:], '--', color='C0') 631 | plt.plot(epoch_x, losses_traj_train_eval[3:], color='C0') 632 | plt.plot(epoch_x, losses_traj_valid[3:], color='C1') 633 | plt.legend(['traj. train', 'traj. train (eval)', 'traj. valid (eval)']) 634 | plt.ylabel('Mean distance (m)') 635 | plt.xlabel('Epoch') 636 | plt.xlim((3, epoch)) 637 | plt.savefig(os.path.join(args.checkpoint, 'loss_traj.png')) 638 | 639 | plt.figure() 640 | plt.plot(epoch_x, losses_2d_train_labeled_eval[3:], color='C0') 641 | plt.plot(epoch_x, losses_2d_train_unlabeled[3:], '--', color='C1') 642 | plt.plot(epoch_x, losses_2d_train_unlabeled_eval[3:], color='C1') 643 | plt.plot(epoch_x, losses_2d_valid[3:], color='C2') 644 | plt.legend(['2d train labeled (eval)', '2d train unlabeled', '2d train unlabeled (eval)', '2d valid (eval)']) 645 | plt.ylabel('MPJPE (2D)') 646 | plt.xlabel('Epoch') 647 | plt.xlim((3, epoch)) 648 | plt.savefig(os.path.join(args.checkpoint, 'loss_2d.png')) 649 | plt.close('all') 650 | 651 | # Evaluate 652 | def evaluate(test_generator, action=None, return_predictions=False, use_trajectory_model=False): 653 | epoch_loss_3d_pos = 0 654 | epoch_loss_3d_pos_procrustes = 0 655 | epoch_loss_3d_pos_scale = 0 656 | epoch_loss_3d_vel = 0 657 | with torch.no_grad(): 658 | if not use_trajectory_model: 659 | model_pos.eval() 660 | else: 661 | model_traj.eval() 662 | N = 0 663 | for _, batch, batch_2d in test_generator.next_epoch(): 664 | inputs_2d = torch.from_numpy(batch_2d.astype('float32')) 665 | if torch.cuda.is_available(): 666 | inputs_2d = inputs_2d.cuda() 667 | 668 | # Positional model 669 | if not use_trajectory_model: 670 | predicted_3d_pos = model_pos(inputs_2d) 671 | else: 672 | predicted_3d_pos = model_traj(inputs_2d) 673 | 674 | # Test-time augmentation (if enabled) 675 | if test_generator.augment_enabled(): 676 | # Undo flipping and take average with non-flipped version 677 | predicted_3d_pos[1, :, :, 0] *= -1 678 | if not use_trajectory_model: 679 | predicted_3d_pos[1, :, joints_left + joints_right] = predicted_3d_pos[1, :, joints_right + joints_left] 680 | predicted_3d_pos = torch.mean(predicted_3d_pos, dim=0, keepdim=True) 681 | 682 | if return_predictions: 683 | return predicted_3d_pos.squeeze(0).cpu().numpy() 684 | 685 | inputs_3d = torch.from_numpy(batch.astype('float32')) 686 | if torch.cuda.is_available(): 687 | inputs_3d = inputs_3d.cuda() 688 | inputs_3d[:, :, 0] = 0 689 | if test_generator.augment_enabled(): 690 | inputs_3d = inputs_3d[:1] 691 | 692 | error = mpjpe(predicted_3d_pos, inputs_3d) 693 | epoch_loss_3d_pos_scale += inputs_3d.shape[0]*inputs_3d.shape[1] * n_mpjpe(predicted_3d_pos, inputs_3d).item() 694 | 695 | epoch_loss_3d_pos += inputs_3d.shape[0]*inputs_3d.shape[1] * error.item() 696 | N += inputs_3d.shape[0] * inputs_3d.shape[1] 697 | 698 | inputs = inputs_3d.cpu().numpy().reshape(-1, inputs_3d.shape[-2], inputs_3d.shape[-1]) 699 | predicted_3d_pos = predicted_3d_pos.cpu().numpy().reshape(-1, inputs_3d.shape[-2], inputs_3d.shape[-1]) 700 | 701 | epoch_loss_3d_pos_procrustes += inputs_3d.shape[0]*inputs_3d.shape[1] * p_mpjpe(predicted_3d_pos, inputs) 702 | 703 | # Compute velocity error 704 | epoch_loss_3d_vel += inputs_3d.shape[0]*inputs_3d.shape[1] * mean_velocity_error(predicted_3d_pos, inputs) 705 | 706 | if action is None: 707 | print('----------') 708 | else: 709 | print('----'+action+'----') 710 | e1 = (epoch_loss_3d_pos / N)*1000 711 | e2 = (epoch_loss_3d_pos_procrustes / N)*1000 712 | e3 = (epoch_loss_3d_pos_scale / N)*1000 713 | ev = (epoch_loss_3d_vel / N)*1000 714 | print('Test time augmentation:', test_generator.augment_enabled()) 715 | print('Protocol #1 Error (MPJPE):', e1, 'mm') 716 | print('Protocol #2 Error (P-MPJPE):', e2, 'mm') 717 | print('Protocol #3 Error (N-MPJPE):', e3, 'mm') 718 | print('Velocity Error (MPJVE):', ev, 'mm') 719 | print('----------') 720 | 721 | return e1, e2, e3, ev 722 | 723 | 724 | if args.render: 725 | print('Rendering...') 726 | 727 | input_keypoints = keypoints[args.viz_subject][args.viz_action][args.viz_camera].copy() 728 | ground_truth = None 729 | if args.viz_subject in dataset.subjects() and args.viz_action in dataset[args.viz_subject]: 730 | if 'positions_3d' in dataset[args.viz_subject][args.viz_action]: 731 | ground_truth = dataset[args.viz_subject][args.viz_action]['positions_3d'][args.viz_camera].copy() 732 | if ground_truth is None: 733 | print('INFO: this action is unlabeled. Ground truth will not be rendered.') 734 | 735 | gen = UnchunkedGenerator(None, None, [input_keypoints], 736 | pad=pad, causal_shift=causal_shift, augment=args.test_time_augmentation, 737 | kps_left=kps_left, kps_right=kps_right, joints_left=joints_left, joints_right=joints_right) 738 | prediction = evaluate(gen, return_predictions=True) 739 | if model_traj is not None and ground_truth is None: 740 | prediction_traj = evaluate(gen, return_predictions=True, use_trajectory_model=True) 741 | prediction += prediction_traj 742 | 743 | if args.viz_export is not None: 744 | print('Exporting joint positions to', args.viz_export) 745 | # Predictions are in camera space 746 | np.save(args.viz_export, prediction) 747 | 748 | if args.viz_output is not None: 749 | if ground_truth is not None: 750 | # Reapply trajectory 751 | trajectory = ground_truth[:, :1] 752 | ground_truth[:, 1:] += trajectory 753 | prediction += trajectory 754 | 755 | # Invert camera transformation 756 | cam = dataset.cameras()[args.viz_subject][args.viz_camera] 757 | if ground_truth is not None: 758 | prediction = camera_to_world(prediction, R=cam['orientation'], t=cam['translation']) 759 | ground_truth = camera_to_world(ground_truth, R=cam['orientation'], t=cam['translation']) 760 | else: 761 | # If the ground truth is not available, take the camera extrinsic params from a random subject. 762 | # They are almost the same, and anyway, we only need this for visualization purposes. 763 | for subject in dataset.cameras(): 764 | if 'orientation' in dataset.cameras()[subject][args.viz_camera]: 765 | rot = dataset.cameras()[subject][args.viz_camera]['orientation'] 766 | break 767 | prediction = camera_to_world(prediction, R=rot, t=0) 768 | # We don't have the trajectory, but at least we can rebase the height 769 | prediction[:, :, 2] -= np.min(prediction[:, :, 2]) 770 | 771 | anim_output = {'Reconstruction': prediction} 772 | if ground_truth is not None and not args.viz_no_ground_truth: 773 | anim_output['Ground truth'] = ground_truth 774 | 775 | input_keypoints = image_coordinates(input_keypoints[..., :2], w=cam['res_w'], h=cam['res_h']) 776 | 777 | from common.visualization import render_animation 778 | render_animation(input_keypoints, keypoints_metadata, anim_output, 779 | dataset.skeleton(), dataset.fps(), args.viz_bitrate, cam['azimuth'], args.viz_output, 780 | limit=args.viz_limit, downsample=args.viz_downsample, size=args.viz_size, 781 | input_video_path=args.viz_video, viewport=(cam['res_w'], cam['res_h']), 782 | input_video_skip=args.viz_skip) 783 | 784 | else: 785 | print('Evaluating...') 786 | all_actions = {} 787 | all_actions_by_subject = {} 788 | for subject in subjects_test: 789 | if subject not in all_actions_by_subject: 790 | all_actions_by_subject[subject] = {} 791 | 792 | for action in dataset[subject].keys(): 793 | action_name = action.split(' ')[0] 794 | if action_name not in all_actions: 795 | all_actions[action_name] = [] 796 | if action_name not in all_actions_by_subject[subject]: 797 | all_actions_by_subject[subject][action_name] = [] 798 | all_actions[action_name].append((subject, action)) 799 | all_actions_by_subject[subject][action_name].append((subject, action)) 800 | 801 | def fetch_actions(actions): 802 | out_poses_3d = [] 803 | out_poses_2d = [] 804 | 805 | for subject, action in actions: 806 | poses_2d = keypoints[subject][action] 807 | for i in range(len(poses_2d)): # Iterate across cameras 808 | out_poses_2d.append(poses_2d[i]) 809 | 810 | poses_3d = dataset[subject][action]['positions_3d'] 811 | assert len(poses_3d) == len(poses_2d), 'Camera count mismatch' 812 | for i in range(len(poses_3d)): # Iterate across cameras 813 | out_poses_3d.append(poses_3d[i]) 814 | 815 | stride = args.downsample 816 | if stride > 1: 817 | # Downsample as requested 818 | for i in range(len(out_poses_2d)): 819 | out_poses_2d[i] = out_poses_2d[i][::stride] 820 | if out_poses_3d is not None: 821 | out_poses_3d[i] = out_poses_3d[i][::stride] 822 | 823 | return out_poses_3d, out_poses_2d 824 | 825 | def run_evaluation(actions, action_filter=None): 826 | errors_p1 = [] 827 | errors_p2 = [] 828 | errors_p3 = [] 829 | errors_vel = [] 830 | 831 | for action_key in actions.keys(): 832 | if action_filter is not None: 833 | found = False 834 | for a in action_filter: 835 | if action_key.startswith(a): 836 | found = True 837 | break 838 | if not found: 839 | continue 840 | 841 | poses_act, poses_2d_act = fetch_actions(actions[action_key]) 842 | gen = UnchunkedGenerator(None, poses_act, poses_2d_act, 843 | pad=pad, causal_shift=causal_shift, augment=args.test_time_augmentation, 844 | kps_left=kps_left, kps_right=kps_right, joints_left=joints_left, joints_right=joints_right) 845 | e1, e2, e3, ev = evaluate(gen, action_key) 846 | errors_p1.append(e1) 847 | errors_p2.append(e2) 848 | errors_p3.append(e3) 849 | errors_vel.append(ev) 850 | 851 | print('Protocol #1 (MPJPE) action-wise average:', round(np.mean(errors_p1), 1), 'mm') 852 | print('Protocol #2 (P-MPJPE) action-wise average:', round(np.mean(errors_p2), 1), 'mm') 853 | print('Protocol #3 (N-MPJPE) action-wise average:', round(np.mean(errors_p3), 1), 'mm') 854 | print('Velocity (MPJVE) action-wise average:', round(np.mean(errors_vel), 2), 'mm') 855 | 856 | if not args.by_subject: 857 | run_evaluation(all_actions, action_filter) 858 | else: 859 | for subject in all_actions_by_subject.keys(): 860 | print('Evaluating on subject', subject) 861 | run_evaluation(all_actions_by_subject[subject], action_filter) 862 | print('') --------------------------------------------------------------------------------