├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── DATASETS.md
├── DOCUMENTATION.md
├── INFERENCE.md
├── LICENSE
├── README.md
├── common
    ├── arguments.py
    ├── camera.py
    ├── custom_dataset.py
    ├── generators.py
    ├── h36m_dataset.py
    ├── humaneva_dataset.py
    ├── loss.py
    ├── mocap_dataset.py
    ├── model.py
    ├── quaternion.py
    ├── skeleton.py
    ├── utils.py
    └── visualization.py
├── data
    ├── ConvertHumanEva.m
    ├── convert_cdf_to_mat.m
    ├── data_utils.py
    ├── prepare_data_2d_custom.py
    ├── prepare_data_2d_h36m_generic.py
    ├── prepare_data_2d_h36m_sh.py
    ├── prepare_data_h36m.py
    └── prepare_data_humaneva.py
├── images
    ├── batching.png
    ├── convolutions_1f_naive.png
    ├── convolutions_1f_optimized.png
    ├── convolutions_anim.gif
    ├── convolutions_causal.png
    ├── convolutions_normal.png
    ├── demo_h36m.gif
    ├── demo_humaneva.gif
    ├── demo_humaneva_unlabeled.gif
    ├── demo_temporal.gif
    └── demo_yt.gif
├── inference
    ├── infer_video.py
    └── infer_video_d2.py
└── run.py


/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | # Code of Conduct
2 | Facebook has adopted a Code of Conduct that we expect project participants to adhere to. Please [read the full text](https://code.facebook.com/codeofconduct) so that you can understand what actions will and will not be tolerated.


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing
 2 | We want to make contributing to this project as easy and transparent as
 3 | possible.
 4 | 
 5 | ## Pull Requests
 6 | We actively welcome your pull requests.
 7 | 
 8 | 1. Fork the repo and create your branch from `master`.
 9 | 2. If you've added code that should be tested, add tests.
10 | 3. If you've changed APIs, update the documentation.
11 | 4. Ensure the test suite passes.
12 | 5. Make sure your code lints.
13 | 6. If you haven't already, complete the Contributor License Agreement ("CLA").
14 | 
15 | ## Contributor License Agreement ("CLA")
16 | In order to accept your pull request, we need you to submit a CLA. You only need
17 | to do this once to work on any of Facebook's open source projects.
18 | 
19 | Complete your CLA here: <https://code.facebook.com/cla>
20 | 
21 | ## Issues
22 | We use GitHub issues to track public bugs. Please ensure your description is
23 | clear and has sufficient instructions to be able to reproduce the issue.
24 | 
25 | ## Coding Style
26 | We follow the [PEP 8](https://www.python.org/dev/peps/pep-0008/) style guidelines.
27 | 
28 | ## License
29 | By contributing to this project, you agree that your contributions will be licensed
30 | under the LICENSE file in the root directory of this source tree.


--------------------------------------------------------------------------------
/DATASETS.md:
--------------------------------------------------------------------------------
  1 | # Dataset setup
  2 | 
  3 | ## Human3.6M
  4 | We provide two ways to set up the Human3.6M dataset on our pipeline. You can either convert the original dataset (recommended) or use the [dataset preprocessed by Martinez et al.](https://github.com/una-dinosauria/3d-pose-baseline) (no longer available as of May 22nd, 2020). The two methods produce the same result. After this step, you should end up with two files in the `data` directory: `data_3d_h36m.npz` for the 3D poses, and `data_2d_h36m_gt.npz` for the ground-truth 2D poses.
  5 | 
  6 | ### Setup from original source (recommended)
  7 | **Update:** we have updated the instructions to simplify the procedure. MATLAB is no longer required for this step.
  8 | 
  9 | Register to the [Human3.6m website](http://vision.imar.ro/human3.6m/) website (or login if you already have an account) and download the dataset in its original format. You only need to download *Poses -> D3 Positions* for each subject (1, 5, 6, 7, 8, 9, 11)
 10 | 
 11 | ##### Instructions without MATLAB (recommended)
 12 | You first need to install `cdflib` Python library via `pip install cdflib`.
 13 | 
 14 | Extract the archives named `Poses_D3_Positions_S*.tgz` (subjects 1, 5, 6, 7, 8, 9, 11) to a common directory. Your directory tree should look like this:
 15 | 
 16 | ```
 17 | /path/to/dataset/S1/MyPoseFeatures/D3_Positions/Directions 1.cdf
 18 | /path/to/dataset/S1/MyPoseFeatures/D3_Positions/Directions.cdf
 19 | ...
 20 | ```
 21 | 
 22 | Then, run the preprocessing script:
 23 | ```sh
 24 | cd data
 25 | python prepare_data_h36m.py --from-source-cdf /path/to/dataset
 26 | cd ..
 27 | ```
 28 | 
 29 | If everything goes well, you are ready to go.
 30 | 
 31 | ##### Instructions with MATLAB (old instructions)
 32 | First, we need to convert the 3D poses from `.cdf` to `.mat`, so they can be loaded from Python scripts. To this end, we have provided the MATLAB script `convert_cdf_to_mat.m` in the `data` directory. Extract the archives named `Poses_D3_Positions_S*.tgz` (subjects 1, 5, 6, 7, 8, 9, 11) to a directory named `pose`, and set up your directory tree so that it looks like this:
 33 | 
 34 | ```
 35 | /path/to/dataset/convert_cdf_to_mat.m
 36 | /path/to/dataset/pose/S1/MyPoseFeatures/D3_Positions/Directions 1.cdf
 37 | /path/to/dataset/pose/S1/MyPoseFeatures/D3_Positions/Directions.cdf
 38 | ...
 39 | ```
 40 | Then run `convert_cdf_to_mat.m` from MATLAB.
 41 | 
 42 | Finally, run the Python conversion script specifying the dataset path:
 43 | ```sh
 44 | cd data
 45 | python prepare_data_h36m.py --from-source /path/to/dataset/pose
 46 | cd ..
 47 | ```
 48 | 
 49 | ### Setup from preprocessed dataset (old instructions)
 50 | **Update:** the link to the preprocessed dataset is no longer available; please use the procedure above. These instructions have been kept for backwards compatibility in case you already have a copy of this archive. All procedures produce the same result.
 51 | 
 52 | Download the [~~h36m.zip archive~~](https://www.dropbox.com/s/e35qv3n6zlkouki/h36m.zip) (source: [3D pose baseline repository](https://github.com/una-dinosauria/3d-pose-baseline)) to the `data` directory, and run the conversion script from the same directory. This step does not require any additional dependency.
 53 | 
 54 | ```sh
 55 | cd data
 56 | wget https://www.dropbox.com/s/e35qv3n6zlkouki/h36m.zip
 57 | python prepare_data_h36m.py --from-archive h36m.zip
 58 | cd ..
 59 | ```
 60 | 
 61 | ## 2D detections for Human3.6M
 62 | We provide support for the following 2D detections:
 63 | 
 64 | - `gt`: ground-truth 2D poses, extracted through the camera projection parameters.
 65 | - `sh_pt_mpii`: Stacked Hourglass detections (model pretrained on MPII, no fine tuning).
 66 | - `sh_ft_h36m`: Stacked Hourglass detections, fine-tuned on Human3.6M.
 67 | - `detectron_pt_h36m`: Detectron (Mask R-CNN) detections (model pretrained on COCO, no fine tuning).
 68 | - `detectron_ft_h36m`: Detectron (Mask R-CNN) detections, fine-tuned on Human3.6M.
 69 | - `cpn_ft_h36m_dbb`: Cascaded Pyramid Network detections, fine-tuned on Human3.6M. Bounding boxes from `detectron_ft_h36m`.
 70 | - User-supplied (see below).
 71 | 
 72 | The 2D detection source is specified through the `--keypoints` parameter, which loads the file `data_2d_DATASET_DETECTION.npz` from the `data` directory, where `DATASET` is the dataset name (e.g. `h36m`) and `DETECTION` is the 2D detection source (e.g. `sh_pt_mpii`). Since all the files are encoded according to the same format, it is trivial to create a custom set of 2D detections.
 73 | 
 74 | Ground-truth poses (`gt`) have already been extracted by the previous step. The other detections must be downloaded manually (see instructions below). You only need to download the detections you want to use. For reference, our best results on Human3.6M are achieved by `cpn_ft_h36m_dbb`.
 75 | 
 76 | ### Mask R-CNN and CPN detections
 77 | You can download these directly and put them in the `data` directory. We recommend starting with:
 78 | 
 79 | ```sh
 80 | cd data
 81 | wget https://dl.fbaipublicfiles.com/video-pose-3d/data_2d_h36m_cpn_ft_h36m_dbb.npz
 82 | wget https://dl.fbaipublicfiles.com/video-pose-3d/data_2d_h36m_detectron_ft_h36m.npz
 83 | cd ..
 84 | ```
 85 | 
 86 | These detections have been produced by models fine-tuned on Human3.6M. We adopted the usual protocol of fine-tuning on 5 subjects (S1, S5, S6, S7, and S8). We also included detections from the unlabeled subjects S2, S3, S4, which can be loaded by our framework for semi-supervised experimentation.
 87 | 
 88 | Optionally, you can download the Mask R-CNN detections without fine-tuning if you want to experiment with these:
 89 | ```sh
 90 | cd data
 91 | wget https://dl.fbaipublicfiles.com/video-pose-3d/data_2d_h36m_detectron_pt_coco.npz
 92 | cd ..
 93 | ```
 94 | 
 95 | ### Stacked Hourglass detections
 96 | These detections (both pretrained and fine-tuned) are provided by [Martinez et al.](https://github.com/una-dinosauria/3d-pose-baseline) in their repository on 3D human pose estimation. The 2D poses produced by the pretrained model are in the same archive as the dataset ([h36m.zip](https://www.dropbox.com/s/e35qv3n6zlkouki/h36m.zip)). The fine-tuned poses can be downloaded [here](https://drive.google.com/open?id=0BxWzojlLp259S2FuUXJ6aUNxZkE). Put the two archives in the `data` directory and run:
 97 | 
 98 | ```sh
 99 | cd data
100 | python prepare_data_2d_h36m_sh.py -pt h36m.zip
101 | python prepare_data_2d_h36m_sh.py -ft stacked_hourglass_fined_tuned_240.tar.gz
102 | cd ..
103 | ```
104 | 
105 | ## HumanEva-I
106 | For HumanEva, you need the original dataset and MATLAB. We provide a MATLAB script to extract the revelant parts of the dataset automatically.
107 | 
108 | 1. Download the [HumanEva-I dataset](http://humaneva.is.tue.mpg.de/datasets_human_1) and extract it.
109 | 2. Download the official [source code v1.1 beta](http://humaneva.is.tue.mpg.de/main/download?file=Release_Code_v1_1_beta.zip) and extract it where you extracted the dataset.
110 | 3. Copy the contents of the directory `Release_Code_v1_1_beta\HumanEva_I` to the root of the source tree (`Release_Code_v1_1_beta/`).
111 | 4. Download the [critical dataset update](http://humaneva.is.tue.mpg.de/main/download?file=Critical_Update_OFS_files.zip) and apply it.
112 | 5. **Important:** for visualization purposes, the original code requires an old library named *dxAvi*, which is used for decoding XVID videos. A precompiled binary for 32-bit architectures is already included, but if you are running MATLAB on a 64-bit system, the code will not work. You can either recompile *dxAvi* library for x64, or bypass it entirely, since we are not using visualization features in our conversion script. To this end, you can patch `@sync_stream/sync_stream.m`, replacing line 202: `ImageStream(I) = image_stream(image_paths{I}, start_image_offset(I));` with `ImageStream(I) = 0;`
113 | 6. Now you can copy our script `ConvertHumanEva.m` (from `data/`) to `Release_Code_v1_1_beta/`, and run it. It will create a directory named `converted_15j`, which contains the converted 2D/3D ground-truth poses on a 15-joint skeleton.
114 | 7. **Optional:** if you want to experiment with a 20-joint skeleton, change `N_JOINTS` to 20 in `ConvertHumanEva.m`, and repeat the process. It will create a directory named `converted_20j`. Adapt next steps accordingly.
115 | 
116 | If you get warnings about mocap errors or dropped frames, this is normal. The HumanEva dataset contains some invalid frames due to occlusions, which are simply discarded. Since we work with videos (and not individual frames), we try to minimize the impact of this issue by grouping valid sequences into contiguous chunks.
117 | 
118 | Finally, run the Python script to produce the final files:
119 | ```
120 | python prepare_data_humaneva.py -p /path/to/dataset/Release_Code_v1_1_beta/converted_15j --convert-3d
121 | ```
122 | You should end up with two files in the `data` directory: `data_3d_humaneva15.npz` for the 3D poses, and `data_2d_humaneva15_gt.npz` for the ground-truth 2D poses.
123 | 
124 | ### 2D detections for HumanEva-I
125 | We provide support for the following 2D detections:
126 | 
127 | - `gt`: ground-truth 2D poses, extracted through camera projection.
128 | - `detectron_pt_coco`: Detectron (Mask R-CNN) detections, pretrained on COCO.
129 | 
130 | Since HumanEva is very small, we do not fine-tune the pretrained models. As before, you can download Mask R-CNN detections from AWS (`data_2d_humaneva15_detectron_pt_coco.npz`, which must be copied to `data/`). As before, we have included detections for unlabeled subjects/actions. These begin with the prefix `Unlabeled/`. Chunks that correspond to corrupted motion capture streams are also marked as unlabeled.
131 | ```sh
132 | cd data
133 | wget https://dl.fbaipublicfiles.com/video-pose-3d/data_2d_humaneva15_detectron_pt_coco.npz
134 | cd ..
135 | ```


--------------------------------------------------------------------------------
/DOCUMENTATION.md:
--------------------------------------------------------------------------------
  1 | # Documentation
  2 | This guide explains in depth all the features of this framework. Make sure you have read the quick start guide in [`README.md`](README.md) before proceeding.
  3 | 
  4 | ## Training
  5 | By default, the script `run.py` runs in training mode. The list of command-line arguments is defined in `common/arguments.py`.
  6 | 
  7 | - `-h`: shows the help / list of parameters.
  8 | - `-d` or `--dataset`: specifies the dataset to use (`h36m` or `humaneva15`). Default: `h36m`. If you converted the 20-joint HumanEva skeleton, you can also use `humaneva20`.
  9 | - `-k` or `--keypoints`: specifies the 2D detections to use. Default: `cpn_ft_h36m_dbb` (CPN fine-tuned on Human 3.6M).
 10 | - `-c` or `--checkpoint`: specifies the directory where checkpoints are saved/read. Default: `checkpoint`.
 11 | - `--checkpoint-frequency`: save checkpoints every N epochs. Default: `10`.
 12 | - `-r` or `--resume`: resume training from a particular checkpoint (you should only specify the file name, not the path), e.g. `epoch_10.bin`.
 13 | - `-str` or `--subjects-train`: specifies the list of subjects on which the model is trained, separated by commas. Default: `S1,S5,S6,S7,S8`. For HumanEva, you may want to specify these manually.
 14 | - `-ste` or `--subjects-test`: specifies the list of subjects on which the model is tested at the end of each epoch (and in the final evaluation), separated by comma. Default: `S9,S11`. For HumanEva, you may want to specify these manually.
 15 | - `-a` or `--actions`: select only a subset of actions, separated by commas. E.g. `Walk,Jog`. By default, all actions are used.
 16 | - `-e` or `--epochs`: train for N epochs, i.e. N passes over the entire training set. Default: `60`.
 17 | - `--no-eval`: disable testing at the end of each epoch (marginal speed up). By default, testing is enabled.
 18 | - `--export-training-curves`: export training curves as PNG images after every epoch. They are saved in the checkpoint directory. Default: disabled.
 19 | 
 20 | 
 21 | If `--no-eval` is not specified, the model is tested at the end of each epoch, although the reported metric is merely an approximation of the final result (for performance reasons). Once training is over, the model is automatically tested using the full procedure. This means that you can also specify the testing parameters when training.
 22 | 
 23 | Here is a description of the model hyperparameters:
 24 | - `-s` or `--stride`: the chunk size used for training, i.e. the number of frames that are predicted at once from each sequence. Increasing this value improves training speed at the expense of the error (due to correlated batch statistics). Default: `1` frame, which ensures maximum decorrelation. When this value is set to `1`, we also employ an optimized implementation of the model (see implementation details).
 25 | - `-b` or `--batch-size`: the batch size used for training the model, in terms of *output frames* (regardless of the stride/chunk length). Default: `1024` frames.
 26 | - `-drop` or `--dropout`: dropout probability. Default: `0.25`.
 27 | - `-lr` or `--learning-rate`: initial learning rate. Default: `0.001`.
 28 | - `-lrd` or `--lr-decay`: learning rate decay after every epoch (multiplicative coefficient). Default: `0.95`.
 29 | - `-no-tta` or `--no-test-time-augmentation`: disable test-time augmentation (which is enabled by default), i.e. do not flip poses horizontally when testing the model. Only effective when combined with data augmentation, so if you disable this you should also disable train-time data augmentation.
 30 | - `-no-da` or `--no-data-augmentation`: disable train-time data augmentation (which is enabled by default), i.e. do not flip poses horizontally to double the training data.
 31 | - `-arc` or `--architecture`: filter widths (only odd numbers supported) separated by commas. This parameter also specifies the number of residual blocks, and determines the receptive field of the model. The first number refers to the input layer, and is followed by the filter widths of the residual blocks. For instance, `3,5,5` uses `3x1` convolutions in the first layer, followed by two residual blocks with `5x1` convolutions. Default: `3,3,3`. Some valid examples are:
 32 | -- `-arc 3,3,3` (27 frames)
 33 | -- `-arc 3,3,7` (63 frames)
 34 | -- `-arc 3,3,3,3` (81 frames)
 35 | -- `-arc 3,3,3,3,3` (243 frames)
 36 | - `--causal`: use causal (i.e. asymmetric) convolutions instead of symmetric convolutions. Causal convolutions are suitable for real-time applications because they do not exploit future frames (they only look in the past), but symmetric convolutions result in a better error since they can consider both past and future data. See below for more details.  Default: disabled.
 37 | - `-ch` or `--channels`: number of channels in convolutions. Default: `1024`.
 38 | - `--dense`: use dense convolutions instead of dilated convolutions. This is only useful for benchmarks and ablation experiments.
 39 | - `--disable-optimizations`: disable the optimized implementation when `--stride` == `1`. This is only useful for benchmarks.
 40 | 
 41 | ## Semi-supervised training
 42 | Semi-supervised learning is only implemented for Human3.6M.
 43 | 
 44 | - `-sun` or `--subjects-unlabeled`: specifies the list of unlabeled subjects that are used for semi-supervision (separated by commas). Semi-supervised learning is automatically enabled when this parameter is set.
 45 | - `--warmup`: number of supervised training epochs before attaching the semi-supervised loss. Default: `1` epoch. You may want to increase this when downsampling the dataset.
 46 | - `--subset`: reduce the size of the training set by a given factor (a real number). E.g. `0.1` uses one tenth of the training data. Subsampling is achieved by extracting a random contiguous chunk from each video, while preserving the original frame rate. Default: `1` (i.e. disabled). This parameter can also be used in a supervised setting, but it is especially useful to simulate data scarcity in a semi-supervised setting.
 47 | - `--downsample`: reduce the dataset frame rate by an integer factor. Default: `1` (i.e. disabled).
 48 | - `--no-bone-length`: do not add the bone length term to the unsupervised loss function (only useful for ablation experiments).
 49 | - `--linear-projection`: ignore non-linear camera distortion parameters when performing projection to 2D, i.e. use only focal length and principal point.
 50 | - `--no-proj`: do not add the projection consistency term to the loss function (only useful for ablations).
 51 | 
 52 | ## Testing
 53 | To test a particular model, you need to specify the checkpoint file via the `--evaluate` parameter, which will be loaded from the checkpoint directory (default: `checkpoint/`, but you can change it using the `-c` parameter). You also need to specify the same settings/hyperparameters that you used for training (e.g. input keypoints, architecture, etc.). The script will not run any compatibility checks -- this is a design choice to facilitate ablation experiments.
 54 | 
 55 | ## Visualization
 56 | You can render videos by specifying both `--evaluate` and  `--render`. The script generates a visualization which contains three viewports: the 2D input keypoints (and optionally, a video overlay), the 3D reconstruction, and the 3D ground truth.
 57 | Note that when you specify a video, the 2D detections are still loaded from the dataset according to the given parameters. It is up to you to choose the correct video. You can also visualize unlabeled videos -- in this case, the ground truth will not be shown.
 58 | 
 59 | Here is a list of the command-line arguments related to visualization:
 60 | - `--viz-subject`: subject to render, e.g. `S1`.
 61 | - `--viz-action`: action to render, e.g. `Walking` or `Walking 1`.
 62 | - `--viz-camera`: camera to render (integer), from 0 to 3 for Human3.6M, 0 to 2 for HumanEva. Default: `0`.
 63 | - `--viz-video`: path to the 2D video to show. If specified, the script will render a skeleton overlay on top of the video. If not specified, a black background will be rendered instead (but the 2D detections will still be shown). 
 64 | - `--viz-skip`: skip the first N frames from the specified video. Useful for HumanEva. Default: `0`.
 65 | - `--viz-output`: output file name (either a `.mp4` or `.gif` file).
 66 | - `--viz-bitrate`: bitrate for MP4 videos. Default: `3000`.
 67 | - `--viz-no-ground-truth`: by default, the videos contain three viewports: the 2D input pose, the 3D reconstruction, and the 3D ground truth. This flags removes the last one.
 68 | - `--viz-limit`: render only first N frames. By default, all frames are rendered.
 69 | - `--viz-downsample`: downsample videos by the specified factor, i.e. reduce the frame rate. E.g. if set to `2`, the frame rate is reduced from 50 FPS to 25 FPS. Default: `1` (no downsampling).
 70 | - `--viz-size`: output resolution multiplier. Higher = larger images. Default: `5`.
 71 | - `--viz-export`: export 3D joint coordinates (in camera space) to the specified NumPy archive.
 72 | 
 73 | Example:
 74 | ```
 75 | python run.py -k cpn_ft_h36m_dbb -arc 3,3,3,3,3 -c checkpoint --evaluate pretrained_h36m_cpn.bin --render --viz-subject S11 --viz-action Walking --viz-camera 0 --viz-video "/path/to/videos/S11/Videos/Walking.54138969.mp4" --viz-output output.gif --viz-size 3 --viz-downsample 2 --viz-limit 60
 76 | ```
 77 | ![](images/demo_h36m.gif)
 78 | 
 79 | Generates a visualization for S11/Walking from camera 0, and exports the first frames to a GIF animation with a frame rate of 25 FPS. If you remove the `--viz-video` parameter, the skeleton overlay will be rendered on a blank background.
 80 | 
 81 | While Human3.6M visualization works out of the box, HumanEva visualization is trickier because the original videos must be segmented manually. Additionally, invalid frames and software synchronization complicate matters. Nonetheless, you can get decent visualizations by selecting the chunk 0 of validation sequences (which start at the beginning of each video) and discarding the first frames using `--viz-skip`. For a suggestion on the number of frames to skip, take a look at `sync_data` in `data/prepare_data_humaneva.py`.
 82 | 
 83 | Example:
 84 | ```
 85 | python run.py -d humaneva15 -k detectron_pt_coco -str Train/S1,Train/S2,Train/S3 -ste Validate/S1,Validate/S2,Validate/S3 -c checkpoint --evaluate pretrained_humaneva15_detectron.bin  --render --viz-subject Validate/S2 --viz-action "Walking 1 chunk0" --viz-camera 0 --viz-output output_he.gif --viz-size 3 --viz-downsample 2 --viz-video "/path/to/videos/S2/Walking_1_(C1).avi" --viz-skip 115 --viz-limit 60
 86 | ```
 87 | ![](images/demo_humaneva.gif)
 88 | 
 89 | Unlabeled videos are easier to visualize because they do not require synchronization with the ground truth. In this case, visualization works out of the box even for HumanEva.
 90 | 
 91 | Example:
 92 | ```
 93 | python run.py -d humaneva15 -k detectron_pt_coco -str Train/S1,Train/S2,Train/S3 -ste Validate/S1,Validate/S2,Validate/S3 -c checkpoint --evaluate pretrained_humaneva15_detectron.bin  --render --viz-subject Unlabeled/S4 --viz-action "Box 2" --viz-camera 0 --viz-output output_he.gif --viz-size 3 --viz-downsample 2 --viz-video "/path/to/videos/S4/Box_2_(C1).avi" --viz-limit 60
 94 | ```
 95 | ![](images/demo_humaneva_unlabeled.gif)
 96 | 
 97 | ## Implementation details
 98 | ### Batch generation during training
 99 | Some details of our training procedure are better understood visually.
100 | ![](images/batching.png)
101 | The figure above shows how training batches are generated, depending on the value of `--stride` (from left to right: 1, 2, and 4). This example shows a sequence of 2D poses which has a length of N = 8 frames. The 3D poses (blue boxes in the figure) are inferred using a model that has a receptive field F = 5 frames. Therefore, because of valid padding, an input sequence of length N results in an output sequence of length N - F + 1, i.e. N - 4 in this example.
102 | 
103 | When `--stride=1`, we generate one training example for each frame. This ensures that the batches are maximally uncorrelated, which helps batch normalization as well as generalization. As `--stride` increases, training becomes faster because the model can reutilize intermediate computations, at the cost of biased batch statistics. However, we provide an optimized implementation when `--stride=1`, which replaces dilated convolutions with strided convolutions (only while training), so in principle you should not touch this parameter unless you want to run specific experiments. To understand how it works, see the figures below:
104 | 
105 | ![](images/convolutions_1f_naive.png)
106 | The figure above shows the information flow for a model with a receptive field of 27 frames, and a single-frame prediction, i.e. from N = 27 input frames we end up with one output frame. You can observe that this regular implementation tends to waste some intermediate results when a small number of frames are predicted. However, for inference of very long sequences, this approach is very efficient as intermediate results are shared among successive frames.
107 | 
108 | ![](images/convolutions_1f_optimized.png)
109 | Therefore, for training *only*, we use the implementation above, which replaces dilated convolutions with strided convolutions. It achieves the same result, but avoids computing unnecessary intermediate results.
110 | 
111 | ### Symmetric convolutions vs causal convolutions
112 | The figures below show the information flow from input (bottom) to output (top). In this example, we adopt a model with a receptive field of 27 frames.
113 | 
114 | ![](images/convolutions_normal.png)
115 | With symmetric convolutions, both past and future information is exploited, resulting in a better reconstruction.
116 | 
117 | ![](images/convolutions_causal.png)
118 | With causal convolutions, only past data is exploited. This approach is suited to real-time applications where future data cannot be exploited, at the cost of a slightly higher error.


--------------------------------------------------------------------------------
/INFERENCE.md:
--------------------------------------------------------------------------------
 1 | # Inference in the wild
 2 | 
 3 | **Update:** we have added support for Detectron2.
 4 | 
 5 | In this short tutorial, we show how to run our model on arbitrary videos and visualize the predictions. Note that this feature is only provided for experimentation/research purposes and presents some limitations, as this repository is meant to provide a reference implementation of the approach described in the paper (not production-ready code for inference in the wild).
 6 | 
 7 | Our script assumes that a video depicts *exactly* one person. In case of multiple people visible at once, the script will select the person corresponding to the bounding box with the highest confidence, which may cause glitches.
 8 | 
 9 | The instructions below show how to use Detectron to infer 2D keypoints from videos, convert them to a custom dataset for our code, and infer 3D poses. For now, we do not have instructions for CPN. In the last section of this tutorial, we also provide some tips.
10 | 
11 | ## Step 1: setup
12 | The inference script requires `ffmpeg`, which you can easily install via conda, pip, or manually.
13 | 
14 | Download the [pretrained model](https://dl.fbaipublicfiles.com/video-pose-3d/pretrained_h36m_detectron_coco.bin) for generating 3D predictions. This model is different than the pretrained ones listed in the main README, as it expects input keypoints in COCO format (generated by the pretrained Detectron model) and outputs 3D joint positions in Human3.6M format. Put this model in the `checkpoint` directory of this repo.
15 | 
16 | **Note:** if you had downloaded `d-pt-243.bin`, you should download the new pretrained model using the link above. `d-pt-243.bin` takes the keypoint probabilities as input (in addition to the x, y coordinates), which causes problems on videos with a different resolution than that of Human3.6M. The new model is only trained on 2D coordinates and works with any resolution/aspect ratio.
17 | 
18 | ## Step 2 (optional): video preprocessing
19 | Since the script expects a single-person scenario, you may want to extract a portion of your video. This is very easy to do with ffmpeg, e.g.
20 | ```
21 | ffmpeg -i input.mp4 -ss 1:00 -to 1:30 -c copy output.mp4
22 | ```
23 | extracts a clip from minute 1:00 to minute 1:30 of `input.mp4`, and exports it to `output.mp4`.
24 | 
25 | Optionally, you can also adapt the frame rate of the video. Most videos have a frame rate of about 25 FPS, but our Human3.6M model was trained on 50-FPS videos. Since our model is robust to alterations in speed, this step is not very important and can be skipped, but if you want the best possible results you can use ffmpeg again for this task:
26 | ```
27 | ffmpeg -i input.mp4 -filter "minterpolate='fps=50'" -crf 0 output.mp4
28 | ```
29 | 
30 | ## Step 3: inferring 2D keypoints with Detectron
31 | 
32 | ### Using Detectron2 (new)
33 | Set up [Detectron2](https://github.com/facebookresearch/detectron2) and use the script  `inference/infer_video_d2.py` (no need to copy this, as it directly uses the Detectron2 API). This script provides a convenient interface to generate 2D keypoint predictions from videos without manually extracting individual frames.
34 | 
35 | To infer keypoints from all the mp4 videos in `input_directory`, run
36 | ```
37 | cd inference
38 | python infer_video_d2.py \
39 |     --cfg COCO-Keypoints/keypoint_rcnn_R_101_FPN_3x.yaml \
40 |     --output-dir output_directory \
41 |     --image-ext mp4 \
42 |     input_directory
43 | ```
44 | The results will be exported to `output_directory` as custom NumPy archives (`.npz` files). You can change the video extension in `--image-ext` (ffmpeg supports a wide range of formats).
45 | 
46 | **Note:** although the architecture is the same (ResNet-101), the weights used by the Detectron2 model are not the same as those used by Detectron1. Since our pretrained model was trained on Detectron1 poses, the result might be slightly different (but it should still be pretty close).
47 | 
48 | ### Using Detectron1 (old instructions)
49 | Set up [Detectron](https://github.com/facebookresearch/Detectron) and copy the script `inference/infer_video.py` from this repo to the `tools` directory of the Detectron repo. This script provides a convenient interface to generate 2D keypoint predictions from videos without manually extracting individual frames.
50 | 
51 | Our Detectron script `infer_video.py` is a simple adaptation of `infer_simple.py` (which works on images) and has a similar command-line syntax.
52 | 
53 | To infer keypoints from all the mp4 videos in `input_directory`, run
54 | ```
55 | python tools/infer_video.py \
56 |     --cfg configs/12_2017_baselines/e2e_keypoint_rcnn_R-101-FPN_s1x.yaml \
57 |     --output-dir output_directory \
58 |     --image-ext mp4 \
59 | 	--wts https://dl.fbaipublicfiles.com/detectron/37698009/12_2017_baselines/e2e_keypoint_rcnn_R-101-FPN_s1x.yaml.08_45_57.YkrJgP6O/output/train/keypoints_coco_2014_train:keypoints_coco_2014_valminusminival/generalized_rcnn/model_final.pkl \
60 |     input_directory
61 | ```
62 | The results will be exported to `output_directory` as custom NumPy archives (`.npz` files). You can change the video extension in `--image-ext` (ffmpeg supports a wide range of formats).
63 | 
64 | ## Step 4: creating a custom dataset
65 | Run our dataset preprocessing script from the `data` directory:
66 | ```
67 | python prepare_data_2d_custom.py -i /path/to/detections/output_directory -o myvideos
68 | ```
69 | This creates a custom dataset named `myvideos` (which contains all the videos in `output_directory`, each of which is mapped to a different subject) and saved to `data_2d_custom_myvideos.npz`. You are free to specify any name for the dataset.
70 | 
71 | **Note:** as mentioned, the script will take the bounding box with the highest probability for each frame. If a particular frame has no bounding boxes, it is assumed to be a missed detection and the keypoints will be interpolated from neighboring frames.
72 | 
73 | ## Step 5: rendering a custom video and exporting coordinates
74 | You can finally use the visualization feature to render a video of the 3D joint predictions. You must specify the `custom` dataset (`-d custom`), the input keypoints as exported in the previous step (`-k myvideos`), the correct architecture/checkpoint, and the action `custom` (`--viz-action custom`). The subject is the file name of the input video, and the camera is always 0.
75 | ```
76 | python run.py -d custom -k myvideos -arc 3,3,3,3,3 -c checkpoint --evaluate pretrained_h36m_detectron_coco.bin --render --viz-subject input_video.mp4 --viz-action custom --viz-camera 0 --viz-video /path/to/input_video.mp4 --viz-output output.mp4 --viz-size 6
77 | ```
78 | 
79 | You can also export the 3D joint positions (in camera space) to a NumPy archive. To this end, replace `--viz-output` with `--viz-export` and specify the file name.
80 | 
81 | ## Limitations and tips
82 | - The model was trained on Human3.6M cameras (which are relatively undistorted), and the results may be bad if the intrinsic parameters of the cameras of your videos differ much from those of Human3.6M. This may be particularly noticeable with fisheye cameras, which present a high degree of non-linear lens distortion. If the camera parameters are known, consider preprocessing your videos to match those of Human3.6M as closely as possible.
83 | - If you want multi-person tracking, you should implement a bounding box matching strategy. An example would be to use bipartite matching on the bounding box overlap (IoU) between subsequent frames, but there are many other approaches.
84 | - Predictions are relative to the root joint, i.e. the global trajectory is not regressed. If you need it, you may want to use another model to regress it, such as the one we use for semi-supervision.
85 | - Predictions are always in *camera space* (regardless of whether the trajectory is available). For our visualization script, we simply take a random camera from Human3.6M, which fits decently most videos where the camera viewport is parallel to the ground. 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 | Attribution-NonCommercial 4.0 International
  2 | 
  3 | =======================================================================
  4 | 
  5 | Creative Commons Corporation ("Creative Commons") is not a law firm and
  6 | does not provide legal services or legal advice. Distribution of
  7 | Creative Commons public licenses does not create a lawyer-client or
  8 | other relationship. Creative Commons makes its licenses and related
  9 | information available on an "as-is" basis. Creative Commons gives no
 10 | warranties regarding its licenses, any material licensed under their
 11 | terms and conditions, or any related information. Creative Commons
 12 | disclaims all liability for damages resulting from their use to the
 13 | fullest extent possible.
 14 | 
 15 | Using Creative Commons Public Licenses
 16 | 
 17 | Creative Commons public licenses provide a standard set of terms and
 18 | conditions that creators and other rights holders may use to share
 19 | original works of authorship and other material subject to copyright
 20 | and certain other rights specified in the public license below. The
 21 | following considerations are for informational purposes only, are not
 22 | exhaustive, and do not form part of our licenses.
 23 | 
 24 |      Considerations for licensors: Our public licenses are
 25 |      intended for use by those authorized to give the public
 26 |      permission to use material in ways otherwise restricted by
 27 |      copyright and certain other rights. Our licenses are
 28 |      irrevocable. Licensors should read and understand the terms
 29 |      and conditions of the license they choose before applying it.
 30 |      Licensors should also secure all rights necessary before
 31 |      applying our licenses so that the public can reuse the
 32 |      material as expected. Licensors should clearly mark any
 33 |      material not subject to the license. This includes other CC-
 34 |      licensed material, or material used under an exception or
 35 |      limitation to copyright. More considerations for licensors:
 36 |      wiki.creativecommons.org/Considerations_for_licensors
 37 | 
 38 |      Considerations for the public: By using one of our public
 39 |      licenses, a licensor grants the public permission to use the
 40 |      licensed material under specified terms and conditions. If
 41 |      the licensor's permission is not necessary for any reason--for
 42 |      example, because of any applicable exception or limitation to
 43 |      copyright--then that use is not regulated by the license. Our
 44 |      licenses grant only permissions under copyright and certain
 45 |      other rights that a licensor has authority to grant. Use of
 46 |      the licensed material may still be restricted for other
 47 |      reasons, including because others have copyright or other
 48 |      rights in the material. A licensor may make special requests,
 49 |      such as asking that all changes be marked or described.
 50 |      Although not required by our licenses, you are encouraged to
 51 |      respect those requests where reasonable. More_considerations
 52 |      for the public: 
 53 |      wiki.creativecommons.org/Considerations_for_licensees
 54 | 
 55 | =======================================================================
 56 | 
 57 | Creative Commons Attribution-NonCommercial 4.0 International Public
 58 | License
 59 | 
 60 | By exercising the Licensed Rights (defined below), You accept and agree
 61 | to be bound by the terms and conditions of this Creative Commons
 62 | Attribution-NonCommercial 4.0 International Public License ("Public
 63 | License"). To the extent this Public License may be interpreted as a
 64 | contract, You are granted the Licensed Rights in consideration of Your
 65 | acceptance of these terms and conditions, and the Licensor grants You
 66 | such rights in consideration of benefits the Licensor receives from
 67 | making the Licensed Material available under these terms and
 68 | conditions.
 69 | 
 70 | Section 1 -- Definitions.
 71 | 
 72 |   a. Adapted Material means material subject to Copyright and Similar
 73 |      Rights that is derived from or based upon the Licensed Material
 74 |      and in which the Licensed Material is translated, altered,
 75 |      arranged, transformed, or otherwise modified in a manner requiring
 76 |      permission under the Copyright and Similar Rights held by the
 77 |      Licensor. For purposes of this Public License, where the Licensed
 78 |      Material is a musical work, performance, or sound recording,
 79 |      Adapted Material is always produced where the Licensed Material is
 80 |      synched in timed relation with a moving image.
 81 | 
 82 |   b. Adapter's License means the license You apply to Your Copyright
 83 |      and Similar Rights in Your contributions to Adapted Material in
 84 |      accordance with the terms and conditions of this Public License.
 85 | 
 86 |   c. Copyright and Similar Rights means copyright and/or similar rights
 87 |      closely related to copyright including, without limitation,
 88 |      performance, broadcast, sound recording, and Sui Generis Database
 89 |      Rights, without regard to how the rights are labeled or
 90 |      categorized. For purposes of this Public License, the rights
 91 |      specified in Section 2(b)(1)-(2) are not Copyright and Similar
 92 |      Rights.
 93 |   d. Effective Technological Measures means those measures that, in the
 94 |      absence of proper authority, may not be circumvented under laws
 95 |      fulfilling obligations under Article 11 of the WIPO Copyright
 96 |      Treaty adopted on December 20, 1996, and/or similar international
 97 |      agreements.
 98 | 
 99 |   e. Exceptions and Limitations means fair use, fair dealing, and/or
100 |      any other exception or limitation to Copyright and Similar Rights
101 |      that applies to Your use of the Licensed Material.
102 | 
103 |   f. Licensed Material means the artistic or literary work, database,
104 |      or other material to which the Licensor applied this Public
105 |      License.
106 | 
107 |   g. Licensed Rights means the rights granted to You subject to the
108 |      terms and conditions of this Public License, which are limited to
109 |      all Copyright and Similar Rights that apply to Your use of the
110 |      Licensed Material and that the Licensor has authority to license.
111 | 
112 |   h. Licensor means the individual(s) or entity(ies) granting rights
113 |      under this Public License.
114 | 
115 |   i. NonCommercial means not primarily intended for or directed towards
116 |      commercial advantage or monetary compensation. For purposes of
117 |      this Public License, the exchange of the Licensed Material for
118 |      other material subject to Copyright and Similar Rights by digital
119 |      file-sharing or similar means is NonCommercial provided there is
120 |      no payment of monetary compensation in connection with the
121 |      exchange.
122 | 
123 |   j. Share means to provide material to the public by any means or
124 |      process that requires permission under the Licensed Rights, such
125 |      as reproduction, public display, public performance, distribution,
126 |      dissemination, communication, or importation, and to make material
127 |      available to the public including in ways that members of the
128 |      public may access the material from a place and at a time
129 |      individually chosen by them.
130 | 
131 |   k. Sui Generis Database Rights means rights other than copyright
132 |      resulting from Directive 96/9/EC of the European Parliament and of
133 |      the Council of 11 March 1996 on the legal protection of databases,
134 |      as amended and/or succeeded, as well as other essentially
135 |      equivalent rights anywhere in the world.
136 | 
137 |   l. You means the individual or entity exercising the Licensed Rights
138 |      under this Public License. Your has a corresponding meaning.
139 | 
140 | Section 2 -- Scope.
141 | 
142 |   a. License grant.
143 | 
144 |        1. Subject to the terms and conditions of this Public License,
145 |           the Licensor hereby grants You a worldwide, royalty-free,
146 |           non-sublicensable, non-exclusive, irrevocable license to
147 |           exercise the Licensed Rights in the Licensed Material to:
148 | 
149 |             a. reproduce and Share the Licensed Material, in whole or
150 |                in part, for NonCommercial purposes only; and
151 | 
152 |             b. produce, reproduce, and Share Adapted Material for
153 |                NonCommercial purposes only.
154 | 
155 |        2. Exceptions and Limitations. For the avoidance of doubt, where
156 |           Exceptions and Limitations apply to Your use, this Public
157 |           License does not apply, and You do not need to comply with
158 |           its terms and conditions.
159 | 
160 |        3. Term. The term of this Public License is specified in Section
161 |           6(a).
162 | 
163 |        4. Media and formats; technical modifications allowed. The
164 |           Licensor authorizes You to exercise the Licensed Rights in
165 |           all media and formats whether now known or hereafter created,
166 |           and to make technical modifications necessary to do so. The
167 |           Licensor waives and/or agrees not to assert any right or
168 |           authority to forbid You from making technical modifications
169 |           necessary to exercise the Licensed Rights, including
170 |           technical modifications necessary to circumvent Effective
171 |           Technological Measures. For purposes of this Public License,
172 |           simply making modifications authorized by this Section 2(a)
173 |           (4) never produces Adapted Material.
174 | 
175 |        5. Downstream recipients.
176 | 
177 |             a. Offer from the Licensor -- Licensed Material. Every
178 |                recipient of the Licensed Material automatically
179 |                receives an offer from the Licensor to exercise the
180 |                Licensed Rights under the terms and conditions of this
181 |                Public License.
182 | 
183 |             b. No downstream restrictions. You may not offer or impose
184 |                any additional or different terms or conditions on, or
185 |                apply any Effective Technological Measures to, the
186 |                Licensed Material if doing so restricts exercise of the
187 |                Licensed Rights by any recipient of the Licensed
188 |                Material.
189 | 
190 |        6. No endorsement. Nothing in this Public License constitutes or
191 |           may be construed as permission to assert or imply that You
192 |           are, or that Your use of the Licensed Material is, connected
193 |           with, or sponsored, endorsed, or granted official status by,
194 |           the Licensor or others designated to receive attribution as
195 |           provided in Section 3(a)(1)(A)(i).
196 | 
197 |   b. Other rights.
198 | 
199 |        1. Moral rights, such as the right of integrity, are not
200 |           licensed under this Public License, nor are publicity,
201 |           privacy, and/or other similar personality rights; however, to
202 |           the extent possible, the Licensor waives and/or agrees not to
203 |           assert any such rights held by the Licensor to the limited
204 |           extent necessary to allow You to exercise the Licensed
205 |           Rights, but not otherwise.
206 | 
207 |        2. Patent and trademark rights are not licensed under this
208 |           Public License.
209 | 
210 |        3. To the extent possible, the Licensor waives any right to
211 |           collect royalties from You for the exercise of the Licensed
212 |           Rights, whether directly or through a collecting society
213 |           under any voluntary or waivable statutory or compulsory
214 |           licensing scheme. In all other cases the Licensor expressly
215 |           reserves any right to collect such royalties, including when
216 |           the Licensed Material is used other than for NonCommercial
217 |           purposes.
218 | 
219 | Section 3 -- License Conditions.
220 | 
221 | Your exercise of the Licensed Rights is expressly made subject to the
222 | following conditions.
223 | 
224 |   a. Attribution.
225 | 
226 |        1. If You Share the Licensed Material (including in modified
227 |           form), You must:
228 | 
229 |             a. retain the following if it is supplied by the Licensor
230 |                with the Licensed Material:
231 | 
232 |                  i. identification of the creator(s) of the Licensed
233 |                     Material and any others designated to receive
234 |                     attribution, in any reasonable manner requested by
235 |                     the Licensor (including by pseudonym if
236 |                     designated);
237 | 
238 |                 ii. a copyright notice;
239 | 
240 |                iii. a notice that refers to this Public License;
241 | 
242 |                 iv. a notice that refers to the disclaimer of
243 |                     warranties;
244 | 
245 |                  v. a URI or hyperlink to the Licensed Material to the
246 |                     extent reasonably practicable;
247 | 
248 |             b. indicate if You modified the Licensed Material and
249 |                retain an indication of any previous modifications; and
250 | 
251 |             c. indicate the Licensed Material is licensed under this
252 |                Public License, and include the text of, or the URI or
253 |                hyperlink to, this Public License.
254 | 
255 |        2. You may satisfy the conditions in Section 3(a)(1) in any
256 |           reasonable manner based on the medium, means, and context in
257 |           which You Share the Licensed Material. For example, it may be
258 |           reasonable to satisfy the conditions by providing a URI or
259 |           hyperlink to a resource that includes the required
260 |           information.
261 | 
262 |        3. If requested by the Licensor, You must remove any of the
263 |           information required by Section 3(a)(1)(A) to the extent
264 |           reasonably practicable.
265 | 
266 |        4. If You Share Adapted Material You produce, the Adapter's
267 |           License You apply must not prevent recipients of the Adapted
268 |           Material from complying with this Public License.
269 | 
270 | Section 4 -- Sui Generis Database Rights.
271 | 
272 | Where the Licensed Rights include Sui Generis Database Rights that
273 | apply to Your use of the Licensed Material:
274 | 
275 |   a. for the avoidance of doubt, Section 2(a)(1) grants You the right
276 |      to extract, reuse, reproduce, and Share all or a substantial
277 |      portion of the contents of the database for NonCommercial purposes
278 |      only;
279 | 
280 |   b. if You include all or a substantial portion of the database
281 |      contents in a database in which You have Sui Generis Database
282 |      Rights, then the database in which You have Sui Generis Database
283 |      Rights (but not its individual contents) is Adapted Material; and
284 | 
285 |   c. You must comply with the conditions in Section 3(a) if You Share
286 |      all or a substantial portion of the contents of the database.
287 | 
288 | For the avoidance of doubt, this Section 4 supplements and does not
289 | replace Your obligations under this Public License where the Licensed
290 | Rights include other Copyright and Similar Rights.
291 | 
292 | Section 5 -- Disclaimer of Warranties and Limitation of Liability.
293 | 
294 |   a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
295 |      EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
296 |      AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
297 |      ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
298 |      IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
299 |      WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
300 |      PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
301 |      ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
302 |      KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
303 |      ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
304 | 
305 |   b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
306 |      TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
307 |      NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
308 |      INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
309 |      COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
310 |      USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
311 |      ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
312 |      DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
313 |      IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
314 | 
315 |   c. The disclaimer of warranties and limitation of liability provided
316 |      above shall be interpreted in a manner that, to the extent
317 |      possible, most closely approximates an absolute disclaimer and
318 |      waiver of all liability.
319 | 
320 | Section 6 -- Term and Termination.
321 | 
322 |   a. This Public License applies for the term of the Copyright and
323 |      Similar Rights licensed here. However, if You fail to comply with
324 |      this Public License, then Your rights under this Public License
325 |      terminate automatically.
326 | 
327 |   b. Where Your right to use the Licensed Material has terminated under
328 |      Section 6(a), it reinstates:
329 | 
330 |        1. automatically as of the date the violation is cured, provided
331 |           it is cured within 30 days of Your discovery of the
332 |           violation; or
333 | 
334 |        2. upon express reinstatement by the Licensor.
335 | 
336 |      For the avoidance of doubt, this Section 6(b) does not affect any
337 |      right the Licensor may have to seek remedies for Your violations
338 |      of this Public License.
339 | 
340 |   c. For the avoidance of doubt, the Licensor may also offer the
341 |      Licensed Material under separate terms or conditions or stop
342 |      distributing the Licensed Material at any time; however, doing so
343 |      will not terminate this Public License.
344 | 
345 |   d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
346 |      License.
347 | 
348 | Section 7 -- Other Terms and Conditions.
349 | 
350 |   a. The Licensor shall not be bound by any additional or different
351 |      terms or conditions communicated by You unless expressly agreed.
352 | 
353 |   b. Any arrangements, understandings, or agreements regarding the
354 |      Licensed Material not stated herein are separate from and
355 |      independent of the terms and conditions of this Public License.
356 | 
357 | Section 8 -- Interpretation.
358 | 
359 |   a. For the avoidance of doubt, this Public License does not, and
360 |      shall not be interpreted to, reduce, limit, restrict, or impose
361 |      conditions on any use of the Licensed Material that could lawfully
362 |      be made without permission under this Public License.
363 | 
364 |   b. To the extent possible, if any provision of this Public License is
365 |      deemed unenforceable, it shall be automatically reformed to the
366 |      minimum extent necessary to make it enforceable. If the provision
367 |      cannot be reformed, it shall be severed from this Public License
368 |      without affecting the enforceability of the remaining terms and
369 |      conditions.
370 | 
371 |   c. No term or condition of this Public License will be waived and no
372 |      failure to comply consented to unless expressly agreed to by the
373 |      Licensor.
374 | 
375 |   d. Nothing in this Public License constitutes or may be interpreted
376 |      as a limitation upon, or waiver of, any privileges and immunities
377 |      that apply to the Licensor or You, including from the legal
378 |      processes of any jurisdiction or authority.
379 | 
380 | =======================================================================
381 | 
382 | Creative Commons is not a party to its public
383 | licenses. Notwithstanding, Creative Commons may elect to apply one of
384 | its public licenses to material it publishes and in those instances
385 | will be considered the “Licensor.” The text of the Creative Commons
386 | public licenses is dedicated to the public domain under the CC0 Public
387 | Domain Dedication. Except for the limited purpose of indicating that
388 | material is shared under a Creative Commons public license or as
389 | otherwise permitted by the Creative Commons policies published at
390 | creativecommons.org/policies, Creative Commons does not authorize the
391 | use of the trademark "Creative Commons" or any other trademark or logo
392 | of Creative Commons without its prior written consent including,
393 | without limitation, in connection with any unauthorized modifications
394 | to any of its public licenses or any other arrangements,
395 | understandings, or agreements concerning use of licensed material. For
396 | the avoidance of doubt, this paragraph does not form part of the
397 | public licenses.
398 | 
399 | Creative Commons may be contacted at creativecommons.org.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # 3D human pose estimation in video with temporal convolutions and semi-supervised training
  2 | <p align="center"><img src="images/convolutions_anim.gif" width="50%" alt="" /></p>
  3 | 
  4 | This is the implementation of the approach described in the paper:
  5 | > Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. [3D human pose estimation in video with temporal convolutions and semi-supervised training](https://arxiv.org/abs/1811.11742). In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  6 | 
  7 | More demos are available at https://dariopavllo.github.io/VideoPose3D
  8 | 
  9 | <p align="center"><img src="images/demo_yt.gif" width="70%" alt="" /></p>
 10 | 
 11 | ![](images/demo_temporal.gif)
 12 | 
 13 | ### Results on Human3.6M
 14 | Under Protocol 1 (mean per-joint position error) and Protocol 2 (mean-per-joint position error after rigid alignment).
 15 | 
 16 | | 2D Detections | BBoxes | Blocks | Receptive Field | Error (P1) | Error (P2) |
 17 | |:-------|:-------:|:-------:|:-------:|:-------:|:-------:|
 18 | | CPN | Mask R-CNN  | 4 | 243 frames | **46.8 mm** | **36.5 mm** |
 19 | | CPN | Ground truth | 4 | 243 frames | 47.1 mm | 36.8 mm |
 20 | | CPN | Ground truth | 3 | 81 frames | 47.7 mm | 37.2 mm |
 21 | | CPN | Ground truth | 2 | 27 frames | 48.8 mm | 38.0 mm |
 22 | | Mask R-CNN | Mask R-CNN | 4 | 243 frames | 51.6 mm | 40.3 mm |
 23 | | Ground truth | -- | 4 | 243 frames | 37.2 mm | 27.2 mm |
 24 | 
 25 | ## Quick start
 26 | To get started as quickly as possible, follow the instructions in this section. This should allow you train a model from scratch, test our pretrained models, and produce basic visualizations. For more detailed instructions, please refer to [`DOCUMENTATION.md`](DOCUMENTATION.md).
 27 | 
 28 | ### Dependencies
 29 | Make sure you have the following dependencies installed before proceeding:
 30 | - Python 3+ distribution
 31 | - PyTorch >= 0.4.0
 32 | 
 33 | Optional:
 34 | - Matplotlib, if you want to visualize predictions. Additionally, you need *ffmpeg* to export MP4 videos, and *imagemagick* to export GIFs.
 35 | - MATLAB, if you want to experiment with HumanEva-I (you need this to convert the dataset). 
 36 | 
 37 | ### Dataset setup
 38 | You can find the instructions for setting up the Human3.6M and HumanEva-I datasets in [`DATASETS.md`](DATASETS.md). For this short guide, we focus on Human3.6M. You are not required to setup HumanEva, unless you want to experiment with it.
 39 | 
 40 | In order to proceed, you must also copy CPN detections (for Human3.6M) and/or Mask R-CNN detections (for HumanEva).
 41 | 
 42 | ### Evaluating our pretrained models
 43 | The pretrained models can be downloaded from AWS. Put `pretrained_h36m_cpn.bin` (for Human3.6M) and/or `pretrained_humaneva15_detectron.bin` (for HumanEva) in the `checkpoint/` directory (create it if it does not exist).
 44 | ```sh
 45 | mkdir checkpoint
 46 | cd checkpoint
 47 | wget https://dl.fbaipublicfiles.com/video-pose-3d/pretrained_h36m_cpn.bin
 48 | wget https://dl.fbaipublicfiles.com/video-pose-3d/pretrained_humaneva15_detectron.bin
 49 | cd ..
 50 | ```
 51 | 
 52 | These models allow you to reproduce our top-performing baselines, which are:
 53 | - 46.8 mm for Human3.6M, using fine-tuned CPN detections, bounding boxes from Mask R-CNN, and an architecture with a receptive field of 243 frames.
 54 | - 33.0 mm for HumanEva-I (on 3 actions), using pretrained Mask R-CNN detections, and an architecture with a receptive field of 27 frames. This is the multi-action model trained on 3 actions (Walk, Jog, Box).
 55 | 
 56 | To test on Human3.6M, run:
 57 | ```
 58 | python run.py -k cpn_ft_h36m_dbb -arc 3,3,3,3,3 -c checkpoint --evaluate pretrained_h36m_cpn.bin
 59 | ```
 60 | 
 61 | To test on HumanEva, run:
 62 | ```
 63 | python run.py -d humaneva15 -k detectron_pt_coco -str Train/S1,Train/S2,Train/S3 -ste Validate/S1,Validate/S2,Validate/S3 -a Walk,Jog,Box --by-subject -c checkpoint --evaluate pretrained_humaneva15_detectron.bin
 64 | ```
 65 | 
 66 | [`DOCUMENTATION.md`](DOCUMENTATION.md) provides a precise description of all command-line arguments.
 67 | 
 68 | ### Inference in the wild
 69 | We have introduced an experimental feature to run our model on custom videos. See [`INFERENCE.md`](INFERENCE.md) for more details.
 70 | 
 71 | ### Training from scratch
 72 | If you want to reproduce the results of our pretrained models, run the following commands.
 73 | 
 74 | For Human3.6M:
 75 | ```
 76 | python run.py -e 80 -k cpn_ft_h36m_dbb -arc 3,3,3,3,3
 77 | ```
 78 | By default the application runs in training mode. This will train a new model for 80 epochs, using fine-tuned CPN detections. Expect a training time of 24 hours on a high-end Pascal GPU. If you feel that this is too much, or your GPU is not powerful enough, you can train a model with a smaller receptive field, e.g.
 79 | - `-arc 3,3,3,3` (81 frames) should require 11 hours and achieve 47.7 mm. 
 80 | - `-arc 3,3,3` (27 frames) should require 6 hours and achieve 48.8 mm.
 81 | 
 82 | You could also lower the number of epochs from 80 to 60 with a negligible impact on the result.
 83 | 
 84 | For HumanEva:
 85 | ```
 86 | python run.py -d humaneva15 -k detectron_pt_coco -str Train/S1,Train/S2,Train/S3 -ste Validate/S1,Validate/S2,Validate/S3 -b 128 -e 1000 -lrd 0.996 -a Walk,Jog,Box --by-subject
 87 | ```
 88 | This will train for 1000 epochs, using Mask R-CNN detections and evaluating each subject separately.
 89 | Since HumanEva is much smaller than Human3.6M, training should require about 50 minutes.
 90 | 
 91 | ### Semi-supervised training
 92 | To perform semi-supervised training, you just need to add the `--subjects-unlabeled` argument. In the example below, we use ground-truth 2D poses as input, and train supervised on just 10% of Subject 1 (specified by `--subset 0.1`). The remaining subjects are treated as unlabeled data and are used for semi-supervision.
 93 | ```
 94 | python run.py -k gt --subjects-train S1 --subset 0.1 --subjects-unlabeled S5,S6,S7,S8 -e 200 -lrd 0.98 -arc 3,3,3 --warmup 5 -b 64
 95 | ```
 96 | This should give you an error around 65.2 mm. By contrast, if we only train supervised
 97 | ```
 98 | python run.py -k gt --subjects-train S1 --subset 0.1 -e 200 -lrd 0.98 -arc 3,3,3 -b 64
 99 | ```
100 | we get around 80.7 mm, which is significantly higher.
101 | 
102 | ### Visualization
103 | If you have the original Human3.6M videos, you can generate nice visualizations of the model predictions. For instance:
104 | ```
105 | python run.py -k cpn_ft_h36m_dbb -arc 3,3,3,3,3 -c checkpoint --evaluate pretrained_h36m_cpn.bin --render --viz-subject S11 --viz-action Walking --viz-camera 0 --viz-video "/path/to/videos/S11/Videos/Walking.54138969.mp4" --viz-output output.gif --viz-size 3 --viz-downsample 2 --viz-limit 60
106 | ```
107 | The script can also export MP4 videos, and supports a variety of parameters (e.g. downsampling/FPS, size, bitrate). See [`DOCUMENTATION.md`](DOCUMENTATION.md) for more details.
108 | 
109 | ## License
110 | This work is licensed under CC BY-NC. See LICENSE for details. Third-party datasets are subject to their respective licenses.
111 | If you use our code/models in your research, please cite our paper:
112 | ```
113 | @inproceedings{pavllo:videopose3d:2019,
114 |   title={3D human pose estimation in video with temporal convolutions and semi-supervised training},
115 |   author={Pavllo, Dario and Feichtenhofer, Christoph and Grangier, David and Auli, Michael},
116 |   booktitle={Conference on Computer Vision and Pattern Recognition (CVPR)},
117 |   year={2019}
118 | }
119 | ```
120 | 


--------------------------------------------------------------------------------
/common/arguments.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2018-present, Facebook, Inc.
 2 | # All rights reserved.
 3 | #
 4 | # This source code is licensed under the license found in the
 5 | # LICENSE file in the root directory of this source tree.
 6 | #
 7 | 
 8 | import argparse
 9 | 
10 | def parse_args():
11 |     parser = argparse.ArgumentParser(description='Training script')
12 | 
13 |     # General arguments
14 |     parser.add_argument('-d', '--dataset', default='h36m', type=str, metavar='NAME', help='target dataset') # h36m or humaneva
15 |     parser.add_argument('-k', '--keypoints', default='cpn_ft_h36m_dbb', type=str, metavar='NAME', help='2D detections to use')
16 |     parser.add_argument('-str', '--subjects-train', default='S1,S5,S6,S7,S8', type=str, metavar='LIST',
17 |                         help='training subjects separated by comma')
18 |     parser.add_argument('-ste', '--subjects-test', default='S9,S11', type=str, metavar='LIST', help='test subjects separated by comma')
19 |     parser.add_argument('-sun', '--subjects-unlabeled', default='', type=str, metavar='LIST',
20 |                         help='unlabeled subjects separated by comma for self-supervision')
21 |     parser.add_argument('-a', '--actions', default='*', type=str, metavar='LIST',
22 |                         help='actions to train/test on, separated by comma, or * for all')
23 |     parser.add_argument('-c', '--checkpoint', default='checkpoint', type=str, metavar='PATH',
24 |                         help='checkpoint directory')
25 |     parser.add_argument('--checkpoint-frequency', default=10, type=int, metavar='N',
26 |                         help='create a checkpoint every N epochs')
27 |     parser.add_argument('-r', '--resume', default='', type=str, metavar='FILENAME',
28 |                         help='checkpoint to resume (file name)')
29 |     parser.add_argument('--evaluate', default='', type=str, metavar='FILENAME', help='checkpoint to evaluate (file name)')
30 |     parser.add_argument('--render', action='store_true', help='visualize a particular video')
31 |     parser.add_argument('--by-subject', action='store_true', help='break down error by subject (on evaluation)')
32 |     parser.add_argument('--export-training-curves', action='store_true', help='save training curves as .png images')
33 | 
34 |     # Model arguments
35 |     parser.add_argument('-s', '--stride', default=1, type=int, metavar='N', help='chunk size to use during training')
36 |     parser.add_argument('-e', '--epochs', default=60, type=int, metavar='N', help='number of training epochs')
37 |     parser.add_argument('-b', '--batch-size', default=1024, type=int, metavar='N', help='batch size in terms of predicted frames')
38 |     parser.add_argument('-drop', '--dropout', default=0.25, type=float, metavar='P', help='dropout probability')
39 |     parser.add_argument('-lr', '--learning-rate', default=0.001, type=float, metavar='LR', help='initial learning rate')
40 |     parser.add_argument('-lrd', '--lr-decay', default=0.95, type=float, metavar='LR', help='learning rate decay per epoch')
41 |     parser.add_argument('-no-da', '--no-data-augmentation', dest='data_augmentation', action='store_false',
42 |                         help='disable train-time flipping')
43 |     parser.add_argument('-no-tta', '--no-test-time-augmentation', dest='test_time_augmentation', action='store_false',
44 |                         help='disable test-time flipping')
45 |     parser.add_argument('-arc', '--architecture', default='3,3,3', type=str, metavar='LAYERS', help='filter widths separated by comma')
46 |     parser.add_argument('--causal', action='store_true', help='use causal convolutions for real-time processing')
47 |     parser.add_argument('-ch', '--channels', default=1024, type=int, metavar='N', help='number of channels in convolution layers')
48 | 
49 |     # Experimental
50 |     parser.add_argument('--subset', default=1, type=float, metavar='FRACTION', help='reduce dataset size by fraction')
51 |     parser.add_argument('--downsample', default=1, type=int, metavar='FACTOR', help='downsample frame rate by factor (semi-supervised)')
52 |     parser.add_argument('--warmup', default=1, type=int, metavar='N', help='warm-up epochs for semi-supervision')
53 |     parser.add_argument('--no-eval', action='store_true', help='disable epoch evaluation while training (small speed-up)')
54 |     parser.add_argument('--dense', action='store_true', help='use dense convolutions instead of dilated convolutions')
55 |     parser.add_argument('--disable-optimizations', action='store_true', help='disable optimized model for single-frame predictions')
56 |     parser.add_argument('--linear-projection', action='store_true', help='use only linear coefficients for semi-supervised projection')
57 |     parser.add_argument('--no-bone-length', action='store_false', dest='bone_length_term',
58 |                         help='disable bone length term in semi-supervised settings')
59 |     parser.add_argument('--no-proj', action='store_true', help='disable projection for semi-supervised setting')
60 |     
61 |     # Visualization
62 |     parser.add_argument('--viz-subject', type=str, metavar='STR', help='subject to render')
63 |     parser.add_argument('--viz-action', type=str, metavar='STR', help='action to render')
64 |     parser.add_argument('--viz-camera', type=int, default=0, metavar='N', help='camera to render')
65 |     parser.add_argument('--viz-video', type=str, metavar='PATH', help='path to input video')
66 |     parser.add_argument('--viz-skip', type=int, default=0, metavar='N', help='skip first N frames of input video')
67 |     parser.add_argument('--viz-output', type=str, metavar='PATH', help='output file name (.gif or .mp4)')
68 |     parser.add_argument('--viz-export', type=str, metavar='PATH', help='output file name for coordinates')
69 |     parser.add_argument('--viz-bitrate', type=int, default=3000, metavar='N', help='bitrate for mp4 videos')
70 |     parser.add_argument('--viz-no-ground-truth', action='store_true', help='do not show ground-truth poses')
71 |     parser.add_argument('--viz-limit', type=int, default=-1, metavar='N', help='only render first N frames')
72 |     parser.add_argument('--viz-downsample', type=int, default=1, metavar='N', help='downsample FPS by a factor N')
73 |     parser.add_argument('--viz-size', type=int, default=5, metavar='N', help='image size')
74 |     
75 |     parser.set_defaults(bone_length_term=True)
76 |     parser.set_defaults(data_augmentation=True)
77 |     parser.set_defaults(test_time_augmentation=True)
78 |     
79 |     args = parser.parse_args()
80 |     # Check invalid configuration
81 |     if args.resume and args.evaluate:
82 |         print('Invalid flags: --resume and --evaluate cannot be set at the same time')
83 |         exit()
84 |         
85 |     if args.export_training_curves and args.no_eval:
86 |         print('Invalid flags: --export-training-curves and --no-eval cannot be set at the same time')
87 |         exit()
88 | 
89 |     return args


--------------------------------------------------------------------------------
/common/camera.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2018-present, Facebook, Inc.
 2 | # All rights reserved.
 3 | #
 4 | # This source code is licensed under the license found in the
 5 | # LICENSE file in the root directory of this source tree.
 6 | #
 7 | 
 8 | import numpy as np
 9 | import torch
10 | 
11 | from common.utils import wrap
12 | from common.quaternion import qrot, qinverse
13 | 
14 | def normalize_screen_coordinates(X, w, h): 
15 |     assert X.shape[-1] == 2
16 |     
17 |     # Normalize so that [0, w] is mapped to [-1, 1], while preserving the aspect ratio
18 |     return X/w*2 - [1, h/w]
19 | 
20 |     
21 | def image_coordinates(X, w, h):
22 |     assert X.shape[-1] == 2
23 |     
24 |     # Reverse camera frame normalization
25 |     return (X + [1, h/w])*w/2
26 |     
27 | 
28 | def world_to_camera(X, R, t):
29 |     Rt = wrap(qinverse, R) # Invert rotation
30 |     return wrap(qrot, np.tile(Rt, (*X.shape[:-1], 1)), X - t) # Rotate and translate
31 | 
32 |     
33 | def camera_to_world(X, R, t):
34 |     return wrap(qrot, np.tile(R, (*X.shape[:-1], 1)), X) + t
35 | 
36 |     
37 | def project_to_2d(X, camera_params):
38 |     """
39 |     Project 3D points to 2D using the Human3.6M camera projection function.
40 |     This is a differentiable and batched reimplementation of the original MATLAB script.
41 |     
42 |     Arguments:
43 |     X -- 3D points in *camera space* to transform (N, *, 3)
44 |     camera_params -- intrinsic parameteres (N, 2+2+3+2=9)
45 |     """
46 |     assert X.shape[-1] == 3
47 |     assert len(camera_params.shape) == 2
48 |     assert camera_params.shape[-1] == 9
49 |     assert X.shape[0] == camera_params.shape[0]
50 |     
51 |     while len(camera_params.shape) < len(X.shape):
52 |         camera_params = camera_params.unsqueeze(1)
53 |         
54 |     f = camera_params[..., :2]
55 |     c = camera_params[..., 2:4]
56 |     k = camera_params[..., 4:7]
57 |     p = camera_params[..., 7:]
58 |     
59 |     XX = torch.clamp(X[..., :2] / X[..., 2:], min=-1, max=1)
60 |     r2 = torch.sum(XX[..., :2]**2, dim=len(XX.shape)-1, keepdim=True)
61 | 
62 |     radial = 1 + torch.sum(k * torch.cat((r2, r2**2, r2**3), dim=len(r2.shape)-1), dim=len(r2.shape)-1, keepdim=True)
63 |     tan = torch.sum(p*XX, dim=len(XX.shape)-1, keepdim=True)
64 | 
65 |     XXX = XX*(radial + tan) + p*r2
66 |     
67 |     return f*XXX + c
68 | 
69 | def project_to_2d_linear(X, camera_params):
70 |     """
71 |     Project 3D points to 2D using only linear parameters (focal length and principal point).
72 |     
73 |     Arguments:
74 |     X -- 3D points in *camera space* to transform (N, *, 3)
75 |     camera_params -- intrinsic parameteres (N, 2+2+3+2=9)
76 |     """
77 |     assert X.shape[-1] == 3
78 |     assert len(camera_params.shape) == 2
79 |     assert camera_params.shape[-1] == 9
80 |     assert X.shape[0] == camera_params.shape[0]
81 |     
82 |     while len(camera_params.shape) < len(X.shape):
83 |         camera_params = camera_params.unsqueeze(1)
84 |         
85 |     f = camera_params[..., :2]
86 |     c = camera_params[..., 2:4]
87 |     
88 |     XX = torch.clamp(X[..., :2] / X[..., 2:], min=-1, max=1)
89 |     
90 |     return f*XX + c


--------------------------------------------------------------------------------
/common/custom_dataset.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2018-present, Facebook, Inc.
 2 | # All rights reserved.
 3 | #
 4 | # This source code is licensed under the license found in the
 5 | # LICENSE file in the root directory of this source tree.
 6 | #
 7 | 
 8 | import numpy as np
 9 | import copy
10 | from common.skeleton import Skeleton
11 | from common.mocap_dataset import MocapDataset
12 | from common.camera import normalize_screen_coordinates, image_coordinates
13 | from common.h36m_dataset import h36m_skeleton
14 |        
15 | 
16 | custom_camera_params = {
17 |     'id': None,
18 |     'res_w': None, # Pulled from metadata
19 |     'res_h': None, # Pulled from metadata
20 |     
21 |     # Dummy camera parameters (taken from Human3.6M), only for visualization purposes
22 |     'azimuth': 70, # Only used for visualization
23 |     'orientation': [0.1407056450843811, -0.1500701755285263, -0.755240797996521, 0.6223280429840088],
24 |     'translation': [1841.1070556640625, 4955.28466796875, 1563.4454345703125],
25 | }
26 | 
27 | class CustomDataset(MocapDataset):
28 |     def __init__(self, detections_path, remove_static_joints=True):
29 |         super().__init__(fps=None, skeleton=h36m_skeleton)        
30 |         
31 |         # Load serialized dataset
32 |         data = np.load(detections_path, allow_pickle=True)
33 |         resolutions = data['metadata'].item()['video_metadata']
34 |         
35 |         self._cameras = {}
36 |         self._data = {}
37 |         for video_name, res in resolutions.items():
38 |             cam = {}
39 |             cam.update(custom_camera_params)
40 |             cam['orientation'] = np.array(cam['orientation'], dtype='float32')
41 |             cam['translation'] = np.array(cam['translation'], dtype='float32')
42 |             cam['translation'] = cam['translation']/1000 # mm to meters
43 |             
44 |             cam['id'] = video_name
45 |             cam['res_w'] = res['w']
46 |             cam['res_h'] = res['h']
47 |             
48 |             self._cameras[video_name] = [cam]
49 |         
50 |             self._data[video_name] = {
51 |                 'custom': {
52 |                     'cameras': cam
53 |                 }
54 |             }
55 |                 
56 |         if remove_static_joints:
57 |             # Bring the skeleton to 17 joints instead of the original 32
58 |             self.remove_joints([4, 5, 9, 10, 11, 16, 20, 21, 22, 23, 24, 28, 29, 30, 31])
59 |             
60 |             # Rewire shoulders to the correct parents
61 |             self._skeleton._parents[11] = 8
62 |             self._skeleton._parents[14] = 8
63 |             
64 |     def supports_semi_supervised(self):
65 |         return False
66 |    


--------------------------------------------------------------------------------
/common/generators.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2018-present, Facebook, Inc.
  2 | # All rights reserved.
  3 | #
  4 | # This source code is licensed under the license found in the
  5 | # LICENSE file in the root directory of this source tree.
  6 | #
  7 | 
  8 | from itertools import zip_longest
  9 | import numpy as np
 10 | 
 11 | class ChunkedGenerator:
 12 |     """
 13 |     Batched data generator, used for training.
 14 |     The sequences are split into equal-length chunks and padded as necessary.
 15 |     
 16 |     Arguments:
 17 |     batch_size -- the batch size to use for training
 18 |     cameras -- list of cameras, one element for each video (optional, used for semi-supervised training)
 19 |     poses_3d -- list of ground-truth 3D poses, one element for each video (optional, used for supervised training)
 20 |     poses_2d -- list of input 2D keypoints, one element for each video
 21 |     chunk_length -- number of output frames to predict for each training example (usually 1)
 22 |     pad -- 2D input padding to compensate for valid convolutions, per side (depends on the receptive field)
 23 |     causal_shift -- asymmetric padding offset when causal convolutions are used (usually 0 or "pad")
 24 |     shuffle -- randomly shuffle the dataset before each epoch
 25 |     random_seed -- initial seed to use for the random generator
 26 |     augment -- augment the dataset by flipping poses horizontally
 27 |     kps_left and kps_right -- list of left/right 2D keypoints if flipping is enabled
 28 |     joints_left and joints_right -- list of left/right 3D joints if flipping is enabled
 29 |     """
 30 |     def __init__(self, batch_size, cameras, poses_3d, poses_2d,
 31 |                  chunk_length, pad=0, causal_shift=0,
 32 |                  shuffle=True, random_seed=1234,
 33 |                  augment=False, kps_left=None, kps_right=None, joints_left=None, joints_right=None,
 34 |                  endless=False):
 35 |         assert poses_3d is None or len(poses_3d) == len(poses_2d), (len(poses_3d), len(poses_2d))
 36 |         assert cameras is None or len(cameras) == len(poses_2d)
 37 |     
 38 |         # Build lineage info
 39 |         pairs = [] # (seq_idx, start_frame, end_frame, flip) tuples
 40 |         for i in range(len(poses_2d)):
 41 |             assert poses_3d is None or poses_3d[i].shape[0] == poses_3d[i].shape[0]
 42 |             n_chunks = (poses_2d[i].shape[0] + chunk_length - 1) // chunk_length
 43 |             offset = (n_chunks * chunk_length - poses_2d[i].shape[0]) // 2
 44 |             bounds = np.arange(n_chunks+1)*chunk_length - offset
 45 |             augment_vector = np.full(len(bounds - 1), False, dtype=bool)
 46 |             pairs += zip(np.repeat(i, len(bounds - 1)), bounds[:-1], bounds[1:], augment_vector)
 47 |             if augment:
 48 |                 pairs += zip(np.repeat(i, len(bounds - 1)), bounds[:-1], bounds[1:], ~augment_vector)
 49 | 
 50 |         # Initialize buffers
 51 |         if cameras is not None:
 52 |             self.batch_cam = np.empty((batch_size, cameras[0].shape[-1]))
 53 |         if poses_3d is not None:
 54 |             self.batch_3d = np.empty((batch_size, chunk_length, poses_3d[0].shape[-2], poses_3d[0].shape[-1]))
 55 |         self.batch_2d = np.empty((batch_size, chunk_length + 2*pad, poses_2d[0].shape[-2], poses_2d[0].shape[-1]))
 56 | 
 57 |         self.num_batches = (len(pairs) + batch_size - 1) // batch_size
 58 |         self.batch_size = batch_size
 59 |         self.random = np.random.RandomState(random_seed)
 60 |         self.pairs = pairs
 61 |         self.shuffle = shuffle
 62 |         self.pad = pad
 63 |         self.causal_shift = causal_shift
 64 |         self.endless = endless
 65 |         self.state = None
 66 |         
 67 |         self.cameras = cameras
 68 |         self.poses_3d = poses_3d
 69 |         self.poses_2d = poses_2d
 70 |         
 71 |         self.augment = augment
 72 |         self.kps_left = kps_left
 73 |         self.kps_right = kps_right
 74 |         self.joints_left = joints_left
 75 |         self.joints_right = joints_right
 76 |         
 77 |     def num_frames(self):
 78 |         return self.num_batches * self.batch_size
 79 |     
 80 |     def random_state(self):
 81 |         return self.random
 82 |     
 83 |     def set_random_state(self, random):
 84 |         self.random = random
 85 |         
 86 |     def augment_enabled(self):
 87 |         return self.augment
 88 |     
 89 |     def next_pairs(self):
 90 |         if self.state is None:
 91 |             if self.shuffle:
 92 |                 pairs = self.random.permutation(self.pairs)
 93 |             else:
 94 |                 pairs = self.pairs
 95 |             return 0, pairs
 96 |         else:
 97 |             return self.state
 98 |     
 99 |     def next_epoch(self):
100 |         enabled = True
101 |         while enabled:
102 |             start_idx, pairs = self.next_pairs()
103 |             for b_i in range(start_idx, self.num_batches):
104 |                 chunks = pairs[b_i*self.batch_size : (b_i+1)*self.batch_size]
105 |                 for i, (seq_i, start_3d, end_3d, flip) in enumerate(chunks):
106 |                     start_2d = start_3d - self.pad - self.causal_shift
107 |                     end_2d = end_3d + self.pad - self.causal_shift
108 | 
109 |                     # 2D poses
110 |                     seq_2d = self.poses_2d[seq_i]
111 |                     low_2d = max(start_2d, 0)
112 |                     high_2d = min(end_2d, seq_2d.shape[0])
113 |                     pad_left_2d = low_2d - start_2d
114 |                     pad_right_2d = end_2d - high_2d
115 |                     if pad_left_2d != 0 or pad_right_2d != 0:
116 |                         self.batch_2d[i] = np.pad(seq_2d[low_2d:high_2d], ((pad_left_2d, pad_right_2d), (0, 0), (0, 0)), 'edge')
117 |                     else:
118 |                         self.batch_2d[i] = seq_2d[low_2d:high_2d]
119 | 
120 |                     if flip:
121 |                         # Flip 2D keypoints
122 |                         self.batch_2d[i, :, :, 0] *= -1
123 |                         self.batch_2d[i, :, self.kps_left + self.kps_right] = self.batch_2d[i, :, self.kps_right + self.kps_left]
124 | 
125 |                     # 3D poses
126 |                     if self.poses_3d is not None:
127 |                         seq_3d = self.poses_3d[seq_i]
128 |                         low_3d = max(start_3d, 0)
129 |                         high_3d = min(end_3d, seq_3d.shape[0])
130 |                         pad_left_3d = low_3d - start_3d
131 |                         pad_right_3d = end_3d - high_3d
132 |                         if pad_left_3d != 0 or pad_right_3d != 0:
133 |                             self.batch_3d[i] = np.pad(seq_3d[low_3d:high_3d], ((pad_left_3d, pad_right_3d), (0, 0), (0, 0)), 'edge')
134 |                         else:
135 |                             self.batch_3d[i] = seq_3d[low_3d:high_3d]
136 | 
137 |                         if flip:
138 |                             # Flip 3D joints
139 |                             self.batch_3d[i, :, :, 0] *= -1
140 |                             self.batch_3d[i, :, self.joints_left + self.joints_right] = \
141 |                                     self.batch_3d[i, :, self.joints_right + self.joints_left]
142 | 
143 |                     # Cameras
144 |                     if self.cameras is not None:
145 |                         self.batch_cam[i] = self.cameras[seq_i]
146 |                         if flip:
147 |                             # Flip horizontal distortion coefficients
148 |                             self.batch_cam[i, 2] *= -1
149 |                             self.batch_cam[i, 7] *= -1
150 | 
151 |                 if self.endless:
152 |                     self.state = (b_i + 1, pairs)
153 |                 if self.poses_3d is None and self.cameras is None:
154 |                     yield None, None, self.batch_2d[:len(chunks)]
155 |                 elif self.poses_3d is not None and self.cameras is None:
156 |                     yield None, self.batch_3d[:len(chunks)], self.batch_2d[:len(chunks)]
157 |                 elif self.poses_3d is None:
158 |                     yield self.batch_cam[:len(chunks)], None, self.batch_2d[:len(chunks)]
159 |                 else:
160 |                     yield self.batch_cam[:len(chunks)], self.batch_3d[:len(chunks)], self.batch_2d[:len(chunks)]
161 |             
162 |             if self.endless:
163 |                 self.state = None
164 |             else:
165 |                 enabled = False
166 |             
167 | 
168 | class UnchunkedGenerator:
169 |     """
170 |     Non-batched data generator, used for testing.
171 |     Sequences are returned one at a time (i.e. batch size = 1), without chunking.
172 |     
173 |     If data augmentation is enabled, the batches contain two sequences (i.e. batch size = 2),
174 |     the second of which is a mirrored version of the first.
175 |     
176 |     Arguments:
177 |     cameras -- list of cameras, one element for each video (optional, used for semi-supervised training)
178 |     poses_3d -- list of ground-truth 3D poses, one element for each video (optional, used for supervised training)
179 |     poses_2d -- list of input 2D keypoints, one element for each video
180 |     pad -- 2D input padding to compensate for valid convolutions, per side (depends on the receptive field)
181 |     causal_shift -- asymmetric padding offset when causal convolutions are used (usually 0 or "pad")
182 |     augment -- augment the dataset by flipping poses horizontally
183 |     kps_left and kps_right -- list of left/right 2D keypoints if flipping is enabled
184 |     joints_left and joints_right -- list of left/right 3D joints if flipping is enabled
185 |     """
186 |     
187 |     def __init__(self, cameras, poses_3d, poses_2d, pad=0, causal_shift=0,
188 |                  augment=False, kps_left=None, kps_right=None, joints_left=None, joints_right=None):
189 |         assert poses_3d is None or len(poses_3d) == len(poses_2d)
190 |         assert cameras is None or len(cameras) == len(poses_2d)
191 | 
192 |         self.augment = augment
193 |         self.kps_left = kps_left
194 |         self.kps_right = kps_right
195 |         self.joints_left = joints_left
196 |         self.joints_right = joints_right
197 |         
198 |         self.pad = pad
199 |         self.causal_shift = causal_shift
200 |         self.cameras = [] if cameras is None else cameras
201 |         self.poses_3d = [] if poses_3d is None else poses_3d
202 |         self.poses_2d = poses_2d
203 |         
204 |     def num_frames(self):
205 |         count = 0
206 |         for p in self.poses_2d:
207 |             count += p.shape[0]
208 |         return count
209 |     
210 |     def augment_enabled(self):
211 |         return self.augment
212 |     
213 |     def set_augment(self, augment):
214 |         self.augment = augment
215 |     
216 |     def next_epoch(self):
217 |         for seq_cam, seq_3d, seq_2d in zip_longest(self.cameras, self.poses_3d, self.poses_2d):
218 |             batch_cam = None if seq_cam is None else np.expand_dims(seq_cam, axis=0)
219 |             batch_3d = None if seq_3d is None else np.expand_dims(seq_3d, axis=0)
220 |             batch_2d = np.expand_dims(np.pad(seq_2d,
221 |                             ((self.pad + self.causal_shift, self.pad - self.causal_shift), (0, 0), (0, 0)),
222 |                             'edge'), axis=0)
223 |             if self.augment:
224 |                 # Append flipped version
225 |                 if batch_cam is not None:
226 |                     batch_cam = np.concatenate((batch_cam, batch_cam), axis=0)
227 |                     batch_cam[1, 2] *= -1
228 |                     batch_cam[1, 7] *= -1
229 |                 
230 |                 if batch_3d is not None:
231 |                     batch_3d = np.concatenate((batch_3d, batch_3d), axis=0)
232 |                     batch_3d[1, :, :, 0] *= -1
233 |                     batch_3d[1, :, self.joints_left + self.joints_right] = batch_3d[1, :, self.joints_right + self.joints_left]
234 | 
235 |                 batch_2d = np.concatenate((batch_2d, batch_2d), axis=0)
236 |                 batch_2d[1, :, :, 0] *= -1
237 |                 batch_2d[1, :, self.kps_left + self.kps_right] = batch_2d[1, :, self.kps_right + self.kps_left]
238 | 
239 |             yield batch_cam, batch_3d, batch_2d


--------------------------------------------------------------------------------
/common/h36m_dataset.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2018-present, Facebook, Inc.
  2 | # All rights reserved.
  3 | #
  4 | # This source code is licensed under the license found in the
  5 | # LICENSE file in the root directory of this source tree.
  6 | #
  7 | 
  8 | import numpy as np
  9 | import copy
 10 | from common.skeleton import Skeleton
 11 | from common.mocap_dataset import MocapDataset
 12 | from common.camera import normalize_screen_coordinates, image_coordinates
 13 |        
 14 | h36m_skeleton = Skeleton(parents=[-1,  0,  1,  2,  3,  4,  0,  6,  7,  8,  9,  0, 11, 12, 13, 14, 12,
 15 |        16, 17, 18, 19, 20, 19, 22, 12, 24, 25, 26, 27, 28, 27, 30],
 16 |        joints_left=[6, 7, 8, 9, 10, 16, 17, 18, 19, 20, 21, 22, 23],
 17 |        joints_right=[1, 2, 3, 4, 5, 24, 25, 26, 27, 28, 29, 30, 31])
 18 | 
 19 | h36m_cameras_intrinsic_params = [
 20 |     {
 21 |         'id': '54138969',
 22 |         'center': [512.54150390625, 515.4514770507812],
 23 |         'focal_length': [1145.0494384765625, 1143.7811279296875],
 24 |         'radial_distortion': [-0.20709891617298126, 0.24777518212795258, -0.0030751503072679043],
 25 |         'tangential_distortion': [-0.0009756988729350269, -0.00142447161488235],
 26 |         'res_w': 1000,
 27 |         'res_h': 1002,
 28 |         'azimuth': 70, # Only used for visualization
 29 |     },
 30 |     {
 31 |         'id': '55011271',
 32 |         'center': [508.8486328125, 508.0649108886719],
 33 |         'focal_length': [1149.6756591796875, 1147.5916748046875],
 34 |         'radial_distortion': [-0.1942136287689209, 0.2404085397720337, 0.006819975562393665],
 35 |         'tangential_distortion': [-0.0016190266469493508, -0.0027408944442868233],
 36 |         'res_w': 1000,
 37 |         'res_h': 1000,
 38 |         'azimuth': -70, # Only used for visualization
 39 |     },
 40 |     {
 41 |         'id': '58860488',
 42 |         'center': [519.8158569335938, 501.40264892578125],
 43 |         'focal_length': [1149.1407470703125, 1148.7989501953125],
 44 |         'radial_distortion': [-0.2083381861448288, 0.25548800826072693, -0.0024604974314570427],
 45 |         'tangential_distortion': [0.0014843869721516967, -0.0007599993259645998],
 46 |         'res_w': 1000,
 47 |         'res_h': 1000,
 48 |         'azimuth': 110, # Only used for visualization
 49 |     },
 50 |     {
 51 |         'id': '60457274',
 52 |         'center': [514.9682006835938, 501.88201904296875],
 53 |         'focal_length': [1145.5113525390625, 1144.77392578125],
 54 |         'radial_distortion': [-0.198384091258049, 0.21832367777824402, -0.008947807364165783],
 55 |         'tangential_distortion': [-0.0005872055771760643, -0.0018133620033040643],
 56 |         'res_w': 1000,
 57 |         'res_h': 1002,
 58 |         'azimuth': -110, # Only used for visualization
 59 |     },
 60 | ]
 61 | 
 62 | h36m_cameras_extrinsic_params = {
 63 |     'S1': [
 64 |         {
 65 |             'orientation': [0.1407056450843811, -0.1500701755285263, -0.755240797996521, 0.6223280429840088],
 66 |             'translation': [1841.1070556640625, 4955.28466796875, 1563.4454345703125],
 67 |         },
 68 |         {
 69 |             'orientation': [0.6157187819480896, -0.764836311340332, -0.14833825826644897, 0.11794740706682205],
 70 |             'translation': [1761.278564453125, -5078.0068359375, 1606.2650146484375],
 71 |         },
 72 |         {
 73 |             'orientation': [0.14651472866535187, -0.14647851884365082, 0.7653023600578308, -0.6094175577163696],
 74 |             'translation': [-1846.7777099609375, 5215.04638671875, 1491.972412109375],
 75 |         },
 76 |         {
 77 |             'orientation': [0.5834008455276489, -0.7853162288665771, 0.14548823237419128, -0.14749594032764435],
 78 |             'translation': [-1794.7896728515625, -3722.698974609375, 1574.8927001953125],
 79 |         },
 80 |     ],
 81 |     'S2': [
 82 |         {},
 83 |         {},
 84 |         {},
 85 |         {},
 86 |     ],
 87 |     'S3': [
 88 |         {},
 89 |         {},
 90 |         {},
 91 |         {},
 92 |     ],
 93 |     'S4': [
 94 |         {},
 95 |         {},
 96 |         {},
 97 |         {},
 98 |     ],
 99 |     'S5': [
100 |         {
101 |             'orientation': [0.1467377245426178, -0.162370964884758, -0.7551892995834351, 0.6178938746452332],
102 |             'translation': [2097.3916015625, 4880.94482421875, 1605.732421875],
103 |         },
104 |         {
105 |             'orientation': [0.6159758567810059, -0.7626792192459106, -0.15728192031383514, 0.1189815029501915],
106 |             'translation': [2031.7008056640625, -5167.93310546875, 1612.923095703125],
107 |         },
108 |         {
109 |             'orientation': [0.14291371405124664, -0.12907841801643372, 0.7678384780883789, -0.6110143065452576],
110 |             'translation': [-1620.5948486328125, 5171.65869140625, 1496.43701171875],
111 |         },
112 |         {
113 |             'orientation': [0.5920479893684387, -0.7814217805862427, 0.1274748593568802, -0.15036417543888092],
114 |             'translation': [-1637.1737060546875, -3867.3173828125, 1547.033203125],
115 |         },
116 |     ],
117 |     'S6': [
118 |         {
119 |             'orientation': [0.1337897777557373, -0.15692396461963654, -0.7571090459823608, 0.6198879480361938],
120 |             'translation': [1935.4517822265625, 4950.24560546875, 1618.0838623046875],
121 |         },
122 |         {
123 |             'orientation': [0.6147197484970093, -0.7628812789916992, -0.16174767911434174, 0.11819244921207428],
124 |             'translation': [1969.803955078125, -5128.73876953125, 1632.77880859375],
125 |         },
126 |         {
127 |             'orientation': [0.1529948115348816, -0.13529130816459656, 0.7646096348762512, -0.6112781167030334],
128 |             'translation': [-1769.596435546875, 5185.361328125, 1476.993408203125],
129 |         },
130 |         {
131 |             'orientation': [0.5916101336479187, -0.7804774045944214, 0.12832270562648773, -0.1561593860387802],
132 |             'translation': [-1721.668701171875, -3884.13134765625, 1540.4879150390625],
133 |         },
134 |     ],
135 |     'S7': [
136 |         {
137 |             'orientation': [0.1435241848230362, -0.1631336808204651, -0.7548328638076782, 0.6188824772834778],
138 |             'translation': [1974.512939453125, 4926.3544921875, 1597.8326416015625],
139 |         },
140 |         {
141 |             'orientation': [0.6141672730445862, -0.7638262510299683, -0.1596645563840866, 0.1177929937839508],
142 |             'translation': [1937.0584716796875, -5119.7900390625, 1631.5665283203125],
143 |         },
144 |         {
145 |             'orientation': [0.14550060033798218, -0.12874816358089447, 0.7660516500473022, -0.6127139329910278],
146 |             'translation': [-1741.8111572265625, 5208.24951171875, 1464.8245849609375],
147 |         },
148 |         {
149 |             'orientation': [0.5912848114967346, -0.7821764349937439, 0.12445473670959473, -0.15196487307548523],
150 |             'translation': [-1734.7105712890625, -3832.42138671875, 1548.5830078125],
151 |         },
152 |     ],
153 |     'S8': [
154 |         {
155 |             'orientation': [0.14110587537288666, -0.15589867532253265, -0.7561917304992676, 0.619644045829773],
156 |             'translation': [2150.65185546875, 4896.1611328125, 1611.9046630859375],
157 |         },
158 |         {
159 |             'orientation': [0.6169601678848267, -0.7647668123245239, -0.14846350252628326, 0.11158157885074615],
160 |             'translation': [2219.965576171875, -5148.453125, 1613.0440673828125],
161 |         },
162 |         {
163 |             'orientation': [0.1471444070339203, -0.13377119600772858, 0.7670128345489502, -0.6100369691848755],
164 |             'translation': [-1571.2215576171875, 5137.0185546875, 1498.1761474609375],
165 |         },
166 |         {
167 |             'orientation': [0.5927824378013611, -0.7825870513916016, 0.12147816270589828, -0.14631995558738708],
168 |             'translation': [-1476.913330078125, -3896.7412109375, 1547.97216796875],
169 |         },
170 |     ],
171 |     'S9': [
172 |         {
173 |             'orientation': [0.15540587902069092, -0.15548215806484222, -0.7532095313072205, 0.6199594736099243],
174 |             'translation': [2044.45849609375, 4935.1171875, 1481.2275390625],
175 |         },
176 |         {
177 |             'orientation': [0.618784487247467, -0.7634735107421875, -0.14132238924503326, 0.11933968216180801],
178 |             'translation': [1990.959716796875, -5123.810546875, 1568.8048095703125],
179 |         },
180 |         {
181 |             'orientation': [0.13357827067375183, -0.1367100477218628, 0.7689454555511475, -0.6100738644599915],
182 |             'translation': [-1670.9921875, 5211.98583984375, 1528.387939453125],
183 |         },
184 |         {
185 |             'orientation': [0.5879399180412292, -0.7823407053947449, 0.1427614390850067, -0.14794869720935822],
186 |             'translation': [-1696.04345703125, -3827.099853515625, 1591.4127197265625],
187 |         },
188 |     ],
189 |     'S11': [
190 |         {
191 |             'orientation': [0.15232472121715546, -0.15442320704460144, -0.7547563314437866, 0.6191070079803467],
192 |             'translation': [2098.440185546875, 4926.5546875, 1500.278564453125],
193 |         },
194 |         {
195 |             'orientation': [0.6189449429512024, -0.7600917220115662, -0.15300633013248444, 0.1255258321762085],
196 |             'translation': [2083.182373046875, -4912.1728515625, 1561.07861328125],
197 |         },
198 |         {
199 |             'orientation': [0.14943228662014008, -0.15650227665901184, 0.7681233882904053, -0.6026304364204407],
200 |             'translation': [-1609.8153076171875, 5177.3359375, 1537.896728515625],
201 |         },
202 |         {
203 |             'orientation': [0.5894251465797424, -0.7818877100944519, 0.13991211354732513, -0.14715361595153809],
204 |             'translation': [-1590.738037109375, -3854.1689453125, 1578.017578125],
205 |         },
206 |     ],
207 | }
208 | 
209 | class Human36mDataset(MocapDataset):
210 |     def __init__(self, path, remove_static_joints=True):
211 |         super().__init__(fps=50, skeleton=h36m_skeleton)
212 |         
213 |         self._cameras = copy.deepcopy(h36m_cameras_extrinsic_params)
214 |         for cameras in self._cameras.values():
215 |             for i, cam in enumerate(cameras):
216 |                 cam.update(h36m_cameras_intrinsic_params[i])
217 |                 for k, v in cam.items():
218 |                     if k not in ['id', 'res_w', 'res_h']:
219 |                         cam[k] = np.array(v, dtype='float32')
220 |                 
221 |                 # Normalize camera frame
222 |                 cam['center'] = normalize_screen_coordinates(cam['center'], w=cam['res_w'], h=cam['res_h']).astype('float32')
223 |                 cam['focal_length'] = cam['focal_length']/cam['res_w']*2
224 |                 if 'translation' in cam:
225 |                     cam['translation'] = cam['translation']/1000 # mm to meters
226 |                 
227 |                 # Add intrinsic parameters vector
228 |                 cam['intrinsic'] = np.concatenate((cam['focal_length'],
229 |                                                    cam['center'],
230 |                                                    cam['radial_distortion'],
231 |                                                    cam['tangential_distortion']))
232 |         
233 |         # Load serialized dataset
234 |         data = np.load(path, allow_pickle=True)['positions_3d'].item()
235 |         
236 |         self._data = {}
237 |         for subject, actions in data.items():
238 |             self._data[subject] = {}
239 |             for action_name, positions in actions.items():
240 |                 self._data[subject][action_name] = {
241 |                     'positions': positions,
242 |                     'cameras': self._cameras[subject],
243 |                 }
244 |                 
245 |         if remove_static_joints:
246 |             # Bring the skeleton to 17 joints instead of the original 32
247 |             self.remove_joints([4, 5, 9, 10, 11, 16, 20, 21, 22, 23, 24, 28, 29, 30, 31])
248 |             
249 |             # Rewire shoulders to the correct parents
250 |             self._skeleton._parents[11] = 8
251 |             self._skeleton._parents[14] = 8
252 |             
253 |     def supports_semi_supervised(self):
254 |         return True
255 |    


--------------------------------------------------------------------------------
/common/humaneva_dataset.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2018-present, Facebook, Inc.
  2 | # All rights reserved.
  3 | #
  4 | # This source code is licensed under the license found in the
  5 | # LICENSE file in the root directory of this source tree.
  6 | #
  7 | 
  8 | import numpy as np
  9 | import copy
 10 | from common.skeleton import Skeleton
 11 | from common.mocap_dataset import MocapDataset
 12 | from common.camera import normalize_screen_coordinates, image_coordinates
 13 |        
 14 | humaneva_skeleton = Skeleton(parents=[-1, 0, 1, 2, 3, 1, 5, 6, 0, 8, 9, 0, 11, 12, 1],
 15 |        joints_left=[2, 3, 4, 8, 9, 10],
 16 |        joints_right=[5, 6, 7, 11, 12, 13])
 17 | 
 18 | humaneva_cameras_intrinsic_params = [
 19 |     {
 20 |         'id': 'C1',
 21 |         'res_w': 640,
 22 |         'res_h': 480,
 23 |         'azimuth': 0, # Only used for visualization
 24 |     },
 25 |     {
 26 |         'id': 'C2',
 27 |         'res_w': 640,
 28 |         'res_h': 480,
 29 |         'azimuth': -90, # Only used for visualization
 30 |     },
 31 |     {
 32 |         'id': 'C3',
 33 |         'res_w': 640,
 34 |         'res_h': 480,
 35 |         'azimuth': 90, # Only used for visualization
 36 |     },
 37 | ]
 38 | 
 39 | humaneva_cameras_extrinsic_params = {
 40 |     'S1': [
 41 |         {
 42 |             'orientation': [0.424207, -0.4983646, -0.5802981, 0.4847012],
 43 |             'translation': [4062.227,  663.2477, 1528.397],
 44 |         },
 45 |         {
 46 |             'orientation': [0.6503354, -0.7481602, -0.0919284, 0.0941766],
 47 |             'translation': [844.8131, -3805.2092,  1504.9929],
 48 |         },
 49 |         {
 50 |             'orientation': [0.0664734, -0.0690535, 0.7416416, -0.6639132],
 51 |             'translation': [-797.67377, 3916.3174, 1433.6602],
 52 |         },
 53 |     ],
 54 |     'S2': [
 55 |         {
 56 |             'orientation': [ 0.4214752, -0.4961493, -0.5838273, 0.4851187 ],
 57 |             'translation': [ 4112.9121,   626.4929,  1545.2988], 
 58 |         },
 59 |         {
 60 |             'orientation': [ 0.6501393, -0.7476588, -0.0954617, 0.0959808 ],
 61 |             'translation': [  923.5740, -3877.9243,  1504.5518], 
 62 |         },
 63 |         {
 64 |             'orientation': [ 0.0699353, -0.0712403, 0.7421637, -0.662742 ],
 65 |             'translation': [ -781.4915,  3838.8853,  1444.9929], 
 66 |         },
 67 |     ],
 68 |     'S3': [
 69 |         {
 70 |             'orientation': [ 0.424207, -0.4983646, -0.5802981, 0.4847012 ],
 71 |             'translation': [ 4062.2271,   663.2477,  1528.3970], 
 72 |         },
 73 |         {
 74 |             'orientation': [ 0.6503354, -0.7481602, -0.0919284, 0.0941766 ],
 75 |             'translation': [  844.8131, -3805.2092,  1504.9929], 
 76 |         },
 77 |         {
 78 |             'orientation': [ 0.0664734, -0.0690535, 0.7416416, -0.6639132 ],
 79 |             'translation': [ -797.6738,  3916.3174,  1433.6602], 
 80 |         },
 81 |     ],
 82 |     'S4': [
 83 |         {},
 84 |         {},
 85 |         {},
 86 |     ],
 87 |     
 88 | }
 89 | 
 90 | class HumanEvaDataset(MocapDataset):
 91 |     def __init__(self, path):
 92 |         super().__init__(fps=60, skeleton=humaneva_skeleton)
 93 |         
 94 |         self._cameras = copy.deepcopy(humaneva_cameras_extrinsic_params)
 95 |         for cameras in self._cameras.values():
 96 |             for i, cam in enumerate(cameras):
 97 |                 cam.update(humaneva_cameras_intrinsic_params[i])
 98 |                 for k, v in cam.items():
 99 |                     if k not in ['id', 'res_w', 'res_h']:
100 |                         cam[k] = np.array(v, dtype='float32')
101 |                 if 'translation' in cam:
102 |                     cam['translation'] = cam['translation']/1000 # mm to meters
103 |                 
104 |         for subject in list(self._cameras.keys()):
105 |             data = self._cameras[subject]
106 |             del self._cameras[subject]
107 |             for prefix in ['Train/', 'Validate/', 'Unlabeled/Train/', 'Unlabeled/Validate/', 'Unlabeled/']:
108 |                 self._cameras[prefix + subject] = data
109 |         
110 |         # Load serialized dataset
111 |         data = np.load(path, allow_pickle=True)['positions_3d'].item()
112 |         
113 |         self._data = {}
114 |         for subject, actions in data.items():
115 |             self._data[subject] = {}
116 |             for action_name, positions in actions.items():
117 |                 self._data[subject][action_name] = {
118 |                     'positions': positions,
119 |                     'cameras': self._cameras[subject],
120 |                 }
121 |    


--------------------------------------------------------------------------------
/common/loss.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2018-present, Facebook, Inc.
 2 | # All rights reserved.
 3 | #
 4 | # This source code is licensed under the license found in the
 5 | # LICENSE file in the root directory of this source tree.
 6 | #
 7 | 
 8 | import torch
 9 | import numpy as np
10 | 
11 | def mpjpe(predicted, target):
12 |     """
13 |     Mean per-joint position error (i.e. mean Euclidean distance),
14 |     often referred to as "Protocol #1" in many papers.
15 |     """
16 |     assert predicted.shape == target.shape
17 |     return torch.mean(torch.norm(predicted - target, dim=len(target.shape)-1))
18 |     
19 | def weighted_mpjpe(predicted, target, w):
20 |     """
21 |     Weighted mean per-joint position error (i.e. mean Euclidean distance)
22 |     """
23 |     assert predicted.shape == target.shape
24 |     assert w.shape[0] == predicted.shape[0]
25 |     return torch.mean(w * torch.norm(predicted - target, dim=len(target.shape)-1))
26 | 
27 | def p_mpjpe(predicted, target):
28 |     """
29 |     Pose error: MPJPE after rigid alignment (scale, rotation, and translation),
30 |     often referred to as "Protocol #2" in many papers.
31 |     """
32 |     assert predicted.shape == target.shape
33 |     
34 |     muX = np.mean(target, axis=1, keepdims=True)
35 |     muY = np.mean(predicted, axis=1, keepdims=True)
36 |     
37 |     X0 = target - muX
38 |     Y0 = predicted - muY
39 | 
40 |     normX = np.sqrt(np.sum(X0**2, axis=(1, 2), keepdims=True))
41 |     normY = np.sqrt(np.sum(Y0**2, axis=(1, 2), keepdims=True))
42 |     
43 |     X0 /= normX
44 |     Y0 /= normY
45 | 
46 |     H = np.matmul(X0.transpose(0, 2, 1), Y0)
47 |     U, s, Vt = np.linalg.svd(H)
48 |     V = Vt.transpose(0, 2, 1)
49 |     R = np.matmul(V, U.transpose(0, 2, 1))
50 | 
51 |     # Avoid improper rotations (reflections), i.e. rotations with det(R) = -1
52 |     sign_detR = np.sign(np.expand_dims(np.linalg.det(R), axis=1))
53 |     V[:, :, -1] *= sign_detR
54 |     s[:, -1] *= sign_detR.flatten()
55 |     R = np.matmul(V, U.transpose(0, 2, 1)) # Rotation
56 | 
57 |     tr = np.expand_dims(np.sum(s, axis=1, keepdims=True), axis=2)
58 | 
59 |     a = tr * normX / normY # Scale
60 |     t = muX - a*np.matmul(muY, R) # Translation
61 |     
62 |     # Perform rigid transformation on the input
63 |     predicted_aligned = a*np.matmul(predicted, R) + t
64 |     
65 |     # Return MPJPE
66 |     return np.mean(np.linalg.norm(predicted_aligned - target, axis=len(target.shape)-1))
67 |     
68 | def n_mpjpe(predicted, target):
69 |     """
70 |     Normalized MPJPE (scale only), adapted from:
71 |     https://github.com/hrhodin/UnsupervisedGeometryAwareRepresentationLearning/blob/master/losses/poses.py
72 |     """
73 |     assert predicted.shape == target.shape
74 |     
75 |     norm_predicted = torch.mean(torch.sum(predicted**2, dim=3, keepdim=True), dim=2, keepdim=True)
76 |     norm_target = torch.mean(torch.sum(target*predicted, dim=3, keepdim=True), dim=2, keepdim=True)
77 |     scale = norm_target / norm_predicted
78 |     return mpjpe(scale * predicted, target)
79 | 
80 | def mean_velocity_error(predicted, target):
81 |     """
82 |     Mean per-joint velocity error (i.e. mean Euclidean distance of the 1st derivative)
83 |     """
84 |     assert predicted.shape == target.shape
85 |     
86 |     velocity_predicted = np.diff(predicted, axis=0)
87 |     velocity_target = np.diff(target, axis=0)
88 |     
89 |     return np.mean(np.linalg.norm(velocity_predicted - velocity_target, axis=len(target.shape)-1))


--------------------------------------------------------------------------------
/common/mocap_dataset.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2018-present, Facebook, Inc.
 2 | # All rights reserved.
 3 | #
 4 | # This source code is licensed under the license found in the
 5 | # LICENSE file in the root directory of this source tree.
 6 | #
 7 | 
 8 | import numpy as np
 9 | from common.skeleton import Skeleton
10 | 
11 | class MocapDataset:
12 |     def __init__(self, fps, skeleton):
13 |         self._skeleton = skeleton
14 |         self._fps = fps
15 |         self._data = None # Must be filled by subclass
16 |         self._cameras = None # Must be filled by subclass
17 |     
18 |     def remove_joints(self, joints_to_remove):
19 |         kept_joints = self._skeleton.remove_joints(joints_to_remove)
20 |         for subject in self._data.keys():
21 |             for action in self._data[subject].keys():
22 |                 s = self._data[subject][action]
23 |                 if 'positions' in s:
24 |                     s['positions'] = s['positions'][:, kept_joints]
25 |                 
26 |         
27 |     def __getitem__(self, key):
28 |         return self._data[key]
29 |         
30 |     def subjects(self):
31 |         return self._data.keys()
32 |     
33 |     def fps(self):
34 |         return self._fps
35 |     
36 |     def skeleton(self):
37 |         return self._skeleton
38 |         
39 |     def cameras(self):
40 |         return self._cameras
41 |     
42 |     def supports_semi_supervised(self):
43 |         # This method can be overridden
44 |         return False


--------------------------------------------------------------------------------
/common/model.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2018-present, Facebook, Inc.
  2 | # All rights reserved.
  3 | #
  4 | # This source code is licensed under the license found in the
  5 | # LICENSE file in the root directory of this source tree.
  6 | #
  7 | 
  8 | import torch.nn as nn
  9 | 
 10 | class TemporalModelBase(nn.Module):
 11 |     """
 12 |     Do not instantiate this class.
 13 |     """
 14 |     
 15 |     def __init__(self, num_joints_in, in_features, num_joints_out,
 16 |                  filter_widths, causal, dropout, channels):
 17 |         super().__init__()
 18 |         
 19 |         # Validate input
 20 |         for fw in filter_widths:
 21 |             assert fw % 2 != 0, 'Only odd filter widths are supported'
 22 |         
 23 |         self.num_joints_in = num_joints_in
 24 |         self.in_features = in_features
 25 |         self.num_joints_out = num_joints_out
 26 |         self.filter_widths = filter_widths
 27 |         
 28 |         self.drop = nn.Dropout(dropout)
 29 |         self.relu = nn.ReLU(inplace=True)
 30 |         
 31 |         self.pad = [ filter_widths[0] // 2 ]
 32 |         self.expand_bn = nn.BatchNorm1d(channels, momentum=0.1)
 33 |         self.shrink = nn.Conv1d(channels, num_joints_out*3, 1)
 34 |         
 35 | 
 36 |     def set_bn_momentum(self, momentum):
 37 |         self.expand_bn.momentum = momentum
 38 |         for bn in self.layers_bn:
 39 |             bn.momentum = momentum
 40 |             
 41 |     def receptive_field(self):
 42 |         """
 43 |         Return the total receptive field of this model as # of frames.
 44 |         """
 45 |         frames = 0
 46 |         for f in self.pad:
 47 |             frames += f
 48 |         return 1 + 2*frames
 49 |     
 50 |     def total_causal_shift(self):
 51 |         """
 52 |         Return the asymmetric offset for sequence padding.
 53 |         The returned value is typically 0 if causal convolutions are disabled,
 54 |         otherwise it is half the receptive field.
 55 |         """
 56 |         frames = self.causal_shift[0]
 57 |         next_dilation = self.filter_widths[0]
 58 |         for i in range(1, len(self.filter_widths)):
 59 |             frames += self.causal_shift[i] * next_dilation
 60 |             next_dilation *= self.filter_widths[i]
 61 |         return frames
 62 |         
 63 |     def forward(self, x):
 64 |         assert len(x.shape) == 4
 65 |         assert x.shape[-2] == self.num_joints_in
 66 |         assert x.shape[-1] == self.in_features
 67 |         
 68 |         sz = x.shape[:3]
 69 |         x = x.view(x.shape[0], x.shape[1], -1)
 70 |         x = x.permute(0, 2, 1)
 71 |         
 72 |         x = self._forward_blocks(x)
 73 |         
 74 |         x = x.permute(0, 2, 1)
 75 |         x = x.view(sz[0], -1, self.num_joints_out, 3)
 76 |         
 77 |         return x    
 78 | 
 79 | class TemporalModel(TemporalModelBase):
 80 |     """
 81 |     Reference 3D pose estimation model with temporal convolutions.
 82 |     This implementation can be used for all use-cases.
 83 |     """
 84 |     
 85 |     def __init__(self, num_joints_in, in_features, num_joints_out,
 86 |                  filter_widths, causal=False, dropout=0.25, channels=1024, dense=False):
 87 |         """
 88 |         Initialize this model.
 89 |         
 90 |         Arguments:
 91 |         num_joints_in -- number of input joints (e.g. 17 for Human3.6M)
 92 |         in_features -- number of input features for each joint (typically 2 for 2D input)
 93 |         num_joints_out -- number of output joints (can be different than input)
 94 |         filter_widths -- list of convolution widths, which also determines the # of blocks and receptive field
 95 |         causal -- use causal convolutions instead of symmetric convolutions (for real-time applications)
 96 |         dropout -- dropout probability
 97 |         channels -- number of convolution channels
 98 |         dense -- use regular dense convolutions instead of dilated convolutions (ablation experiment)
 99 |         """
100 |         super().__init__(num_joints_in, in_features, num_joints_out, filter_widths, causal, dropout, channels)
101 |         
102 |         self.expand_conv = nn.Conv1d(num_joints_in*in_features, channels, filter_widths[0], bias=False)
103 |         
104 |         layers_conv = []
105 |         layers_bn = []
106 |         
107 |         self.causal_shift = [ (filter_widths[0]) // 2 if causal else 0 ]
108 |         next_dilation = filter_widths[0]
109 |         for i in range(1, len(filter_widths)):
110 |             self.pad.append((filter_widths[i] - 1)*next_dilation // 2)
111 |             self.causal_shift.append((filter_widths[i]//2 * next_dilation) if causal else 0)
112 |             
113 |             layers_conv.append(nn.Conv1d(channels, channels,
114 |                                          filter_widths[i] if not dense else (2*self.pad[-1] + 1),
115 |                                          dilation=next_dilation if not dense else 1,
116 |                                          bias=False))
117 |             layers_bn.append(nn.BatchNorm1d(channels, momentum=0.1))
118 |             layers_conv.append(nn.Conv1d(channels, channels, 1, dilation=1, bias=False))
119 |             layers_bn.append(nn.BatchNorm1d(channels, momentum=0.1))
120 |             
121 |             next_dilation *= filter_widths[i]
122 |             
123 |         self.layers_conv = nn.ModuleList(layers_conv)
124 |         self.layers_bn = nn.ModuleList(layers_bn)
125 |         
126 |     def _forward_blocks(self, x):
127 |         x = self.drop(self.relu(self.expand_bn(self.expand_conv(x))))
128 |         
129 |         for i in range(len(self.pad) - 1):
130 |             pad = self.pad[i+1]
131 |             shift = self.causal_shift[i+1]
132 |             res = x[:, :, pad + shift : x.shape[2] - pad + shift]
133 |             
134 |             x = self.drop(self.relu(self.layers_bn[2*i](self.layers_conv[2*i](x))))
135 |             x = res + self.drop(self.relu(self.layers_bn[2*i + 1](self.layers_conv[2*i + 1](x))))
136 |         
137 |         x = self.shrink(x)
138 |         return x
139 |     
140 | class TemporalModelOptimized1f(TemporalModelBase):
141 |     """
142 |     3D pose estimation model optimized for single-frame batching, i.e.
143 |     where batches have input length = receptive field, and output length = 1.
144 |     This scenario is only used for training when stride == 1.
145 |     
146 |     This implementation replaces dilated convolutions with strided convolutions
147 |     to avoid generating unused intermediate results. The weights are interchangeable
148 |     with the reference implementation.
149 |     """
150 |     
151 |     def __init__(self, num_joints_in, in_features, num_joints_out,
152 |                  filter_widths, causal=False, dropout=0.25, channels=1024):
153 |         """
154 |         Initialize this model.
155 |         
156 |         Arguments:
157 |         num_joints_in -- number of input joints (e.g. 17 for Human3.6M)
158 |         in_features -- number of input features for each joint (typically 2 for 2D input)
159 |         num_joints_out -- number of output joints (can be different than input)
160 |         filter_widths -- list of convolution widths, which also determines the # of blocks and receptive field
161 |         causal -- use causal convolutions instead of symmetric convolutions (for real-time applications)
162 |         dropout -- dropout probability
163 |         channels -- number of convolution channels
164 |         """
165 |         super().__init__(num_joints_in, in_features, num_joints_out, filter_widths, causal, dropout, channels)
166 |         
167 |         self.expand_conv = nn.Conv1d(num_joints_in*in_features, channels, filter_widths[0], stride=filter_widths[0], bias=False)
168 |         
169 |         layers_conv = []
170 |         layers_bn = []
171 |         
172 |         self.causal_shift = [ (filter_widths[0] // 2) if causal else 0 ]
173 |         next_dilation = filter_widths[0]
174 |         for i in range(1, len(filter_widths)):
175 |             self.pad.append((filter_widths[i] - 1)*next_dilation // 2)
176 |             self.causal_shift.append((filter_widths[i]//2) if causal else 0)
177 |             
178 |             layers_conv.append(nn.Conv1d(channels, channels, filter_widths[i], stride=filter_widths[i], bias=False))
179 |             layers_bn.append(nn.BatchNorm1d(channels, momentum=0.1))
180 |             layers_conv.append(nn.Conv1d(channels, channels, 1, dilation=1, bias=False))
181 |             layers_bn.append(nn.BatchNorm1d(channels, momentum=0.1))
182 |             next_dilation *= filter_widths[i]
183 |             
184 |         self.layers_conv = nn.ModuleList(layers_conv)
185 |         self.layers_bn = nn.ModuleList(layers_bn)
186 |         
187 |     def _forward_blocks(self, x):
188 |         x = self.drop(self.relu(self.expand_bn(self.expand_conv(x))))
189 |         
190 |         for i in range(len(self.pad) - 1):
191 |             res = x[:, :, self.causal_shift[i+1] + self.filter_widths[i+1]//2 :: self.filter_widths[i+1]]
192 |             
193 |             x = self.drop(self.relu(self.layers_bn[2*i](self.layers_conv[2*i](x))))
194 |             x = res + self.drop(self.relu(self.layers_bn[2*i + 1](self.layers_conv[2*i + 1](x))))
195 |         
196 |         x = self.shrink(x)
197 |         return x


--------------------------------------------------------------------------------
/common/quaternion.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2018-present, Facebook, Inc.
 2 | # All rights reserved.
 3 | #
 4 | # This source code is licensed under the license found in the
 5 | # LICENSE file in the root directory of this source tree.
 6 | #
 7 | 
 8 | import torch
 9 | 
10 | def qrot(q, v):
11 |     """
12 |     Rotate vector(s) v about the rotation described by quaternion(s) q.
13 |     Expects a tensor of shape (*, 4) for q and a tensor of shape (*, 3) for v,
14 |     where * denotes any number of dimensions.
15 |     Returns a tensor of shape (*, 3).
16 |     """
17 |     assert q.shape[-1] == 4
18 |     assert v.shape[-1] == 3
19 |     assert q.shape[:-1] == v.shape[:-1]
20 | 
21 |     qvec = q[..., 1:]
22 |     uv = torch.cross(qvec, v, dim=len(q.shape)-1)
23 |     uuv = torch.cross(qvec, uv, dim=len(q.shape)-1)
24 |     return (v + 2 * (q[..., :1] * uv + uuv))
25 |     
26 |     
27 | def qinverse(q, inplace=False):
28 |     # We assume the quaternion to be normalized
29 |     if inplace:
30 |         q[..., 1:] *= -1
31 |         return q
32 |     else:
33 |         w = q[..., :1]
34 |         xyz = q[..., 1:]
35 |         return torch.cat((w, -xyz), dim=len(q.shape)-1)


--------------------------------------------------------------------------------
/common/skeleton.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2018-present, Facebook, Inc.
 2 | # All rights reserved.
 3 | #
 4 | # This source code is licensed under the license found in the
 5 | # LICENSE file in the root directory of this source tree.
 6 | #
 7 | 
 8 | import numpy as np
 9 | 
10 | class Skeleton:
11 |     def __init__(self, parents, joints_left, joints_right):
12 |         assert len(joints_left) == len(joints_right)
13 |         
14 |         self._parents = np.array(parents)
15 |         self._joints_left = joints_left
16 |         self._joints_right = joints_right
17 |         self._compute_metadata()
18 |     
19 |     def num_joints(self):
20 |         return len(self._parents)
21 |     
22 |     def parents(self):
23 |         return self._parents
24 |     
25 |     def has_children(self):
26 |         return self._has_children
27 |     
28 |     def children(self):
29 |         return self._children
30 |     
31 |     def remove_joints(self, joints_to_remove):
32 |         """
33 |         Remove the joints specified in 'joints_to_remove'.
34 |         """
35 |         valid_joints = []
36 |         for joint in range(len(self._parents)):
37 |             if joint not in joints_to_remove:
38 |                 valid_joints.append(joint)
39 | 
40 |         for i in range(len(self._parents)):
41 |             while self._parents[i] in joints_to_remove:
42 |                 self._parents[i] = self._parents[self._parents[i]]
43 |                 
44 |         index_offsets = np.zeros(len(self._parents), dtype=int)
45 |         new_parents = []
46 |         for i, parent in enumerate(self._parents):
47 |             if i not in joints_to_remove:
48 |                 new_parents.append(parent - index_offsets[parent])
49 |             else:
50 |                 index_offsets[i:] += 1
51 |         self._parents = np.array(new_parents)
52 |         
53 |         
54 |         if self._joints_left is not None:
55 |             new_joints_left = []
56 |             for joint in self._joints_left:
57 |                 if joint in valid_joints:
58 |                     new_joints_left.append(joint - index_offsets[joint])
59 |             self._joints_left = new_joints_left
60 |         if self._joints_right is not None:
61 |             new_joints_right = []
62 |             for joint in self._joints_right:
63 |                 if joint in valid_joints:
64 |                     new_joints_right.append(joint - index_offsets[joint])
65 |             self._joints_right = new_joints_right
66 | 
67 |         self._compute_metadata()
68 |         
69 |         return valid_joints
70 |     
71 |     def joints_left(self):
72 |         return self._joints_left
73 |     
74 |     def joints_right(self):
75 |         return self._joints_right
76 |         
77 |     def _compute_metadata(self):
78 |         self._has_children = np.zeros(len(self._parents)).astype(bool)
79 |         for i, parent in enumerate(self._parents):
80 |             if parent != -1:
81 |                 self._has_children[parent] = True
82 | 
83 |         self._children = []
84 |         for i, parent in enumerate(self._parents):
85 |             self._children.append([])
86 |         for i, parent in enumerate(self._parents):
87 |             if parent != -1:
88 |                 self._children[parent].append(i)


--------------------------------------------------------------------------------
/common/utils.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2018-present, Facebook, Inc.
 2 | # All rights reserved.
 3 | #
 4 | # This source code is licensed under the license found in the
 5 | # LICENSE file in the root directory of this source tree.
 6 | #
 7 | 
 8 | import torch
 9 | import numpy as np
10 | import hashlib
11 | 
12 | def wrap(func, *args, unsqueeze=False):
13 |     """
14 |     Wrap a torch function so it can be called with NumPy arrays.
15 |     Input and return types are seamlessly converted.
16 |     """
17 |     
18 |     # Convert input types where applicable
19 |     args = list(args)
20 |     for i, arg in enumerate(args):
21 |         if type(arg) == np.ndarray:
22 |             args[i] = torch.from_numpy(arg)
23 |             if unsqueeze:
24 |                 args[i] = args[i].unsqueeze(0)
25 |         
26 |     result = func(*args)
27 |     
28 |     # Convert output types where applicable
29 |     if isinstance(result, tuple):
30 |         result = list(result)
31 |         for i, res in enumerate(result):
32 |             if type(res) == torch.Tensor:
33 |                 if unsqueeze:
34 |                     res = res.squeeze(0)
35 |                 result[i] = res.numpy()
36 |         return tuple(result)
37 |     elif type(result) == torch.Tensor:
38 |         if unsqueeze:
39 |             result = result.squeeze(0)
40 |         return result.numpy()
41 |     else:
42 |         return result
43 |     
44 | def deterministic_random(min_value, max_value, data):
45 |     digest = hashlib.sha256(data.encode()).digest()
46 |     raw_value = int.from_bytes(digest[:4], byteorder='little', signed=False)
47 |     return int(raw_value / (2**32 - 1) * (max_value - min_value)) + min_value


--------------------------------------------------------------------------------
/common/visualization.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2018-present, Facebook, Inc.
  2 | # All rights reserved.
  3 | #
  4 | # This source code is licensed under the license found in the
  5 | # LICENSE file in the root directory of this source tree.
  6 | #
  7 | 
  8 | import matplotlib
  9 | matplotlib.use('Agg')
 10 | 
 11 | import matplotlib.pyplot as plt
 12 | from matplotlib.animation import FuncAnimation, writers
 13 | from mpl_toolkits.mplot3d import Axes3D
 14 | import numpy as np
 15 | import subprocess as sp
 16 | 
 17 | def get_resolution(filename):
 18 |     command = ['ffprobe', '-v', 'error', '-select_streams', 'v:0',
 19 |                '-show_entries', 'stream=width,height', '-of', 'csv=p=0', filename]
 20 |     with sp.Popen(command, stdout=sp.PIPE, bufsize=-1) as pipe:
 21 |         for line in pipe.stdout:
 22 |             w, h = line.decode().strip().split(',')
 23 |             return int(w), int(h)
 24 |             
 25 | def get_fps(filename):
 26 |     command = ['ffprobe', '-v', 'error', '-select_streams', 'v:0',
 27 |                '-show_entries', 'stream=r_frame_rate', '-of', 'csv=p=0', filename]
 28 |     with sp.Popen(command, stdout=sp.PIPE, bufsize=-1) as pipe:
 29 |         for line in pipe.stdout:
 30 |             a, b = line.decode().strip().split('/')
 31 |             return int(a) / int(b)
 32 | 
 33 | def read_video(filename, skip=0, limit=-1):
 34 |     w, h = get_resolution(filename)
 35 |     
 36 |     command = ['ffmpeg',
 37 |             '-i', filename,
 38 |             '-f', 'image2pipe',
 39 |             '-pix_fmt', 'rgb24',
 40 |             '-vsync', '0',
 41 |             '-vcodec', 'rawvideo', '-']
 42 |     
 43 |     i = 0
 44 |     with sp.Popen(command, stdout = sp.PIPE, bufsize=-1) as pipe:
 45 |         while True:
 46 |             data = pipe.stdout.read(w*h*3)
 47 |             if not data:
 48 |                 break
 49 |             i += 1
 50 |             if i > limit and limit != -1:
 51 |                 continue
 52 |             if i > skip:
 53 |                 yield np.frombuffer(data, dtype='uint8').reshape((h, w, 3))
 54 |             
 55 |                 
 56 |                 
 57 |     
 58 | def downsample_tensor(X, factor):
 59 |     length = X.shape[0]//factor * factor
 60 |     return np.mean(X[:length].reshape(-1, factor, *X.shape[1:]), axis=1)
 61 | 
 62 | def render_animation(keypoints, keypoints_metadata, poses, skeleton, fps, bitrate, azim, output, viewport,
 63 |                      limit=-1, downsample=1, size=6, input_video_path=None, input_video_skip=0):
 64 |     """
 65 |     TODO
 66 |     Render an animation. The supported output modes are:
 67 |      -- 'interactive': display an interactive figure
 68 |                        (also works on notebooks if associated with %matplotlib inline)
 69 |      -- 'html': render the animation as HTML5 video. Can be displayed in a notebook using HTML(...).
 70 |      -- 'filename.mp4': render and export the animation as an h264 video (requires ffmpeg).
 71 |      -- 'filename.gif': render and export the animation a gif file (requires imagemagick).
 72 |     """
 73 |     plt.ioff()
 74 |     fig = plt.figure(figsize=(size*(1 + len(poses)), size))
 75 |     ax_in = fig.add_subplot(1, 1 + len(poses), 1)
 76 |     ax_in.get_xaxis().set_visible(False)
 77 |     ax_in.get_yaxis().set_visible(False)
 78 |     ax_in.set_axis_off()
 79 |     ax_in.set_title('Input')
 80 | 
 81 |     ax_3d = []
 82 |     lines_3d = []
 83 |     trajectories = []
 84 |     radius = 1.7
 85 |     for index, (title, data) in enumerate(poses.items()):
 86 |         ax = fig.add_subplot(1, 1 + len(poses), index+2, projection='3d')
 87 |         ax.view_init(elev=15., azim=azim)
 88 |         ax.set_xlim3d([-radius/2, radius/2])
 89 |         ax.set_zlim3d([0, radius])
 90 |         ax.set_ylim3d([-radius/2, radius/2])
 91 |         try:
 92 |             ax.set_aspect('equal')
 93 |         except NotImplementedError:
 94 |             ax.set_aspect('auto')
 95 |         ax.set_xticklabels([])
 96 |         ax.set_yticklabels([])
 97 |         ax.set_zticklabels([])
 98 |         ax.dist = 7.5
 99 |         ax.set_title(title) #, pad=35
100 |         ax_3d.append(ax)
101 |         lines_3d.append([])
102 |         trajectories.append(data[:, 0, [0, 1]])
103 |     poses = list(poses.values())
104 | 
105 |     # Decode video
106 |     if input_video_path is None:
107 |         # Black background
108 |         all_frames = np.zeros((keypoints.shape[0], viewport[1], viewport[0]), dtype='uint8')
109 |     else:
110 |         # Load video using ffmpeg
111 |         all_frames = []
112 |         for f in read_video(input_video_path, skip=input_video_skip, limit=limit):
113 |             all_frames.append(f)
114 |         effective_length = min(keypoints.shape[0], len(all_frames))
115 |         all_frames = all_frames[:effective_length]
116 |         
117 |         keypoints = keypoints[input_video_skip:] # todo remove
118 |         for idx in range(len(poses)):
119 |             poses[idx] = poses[idx][input_video_skip:]
120 |         
121 |         if fps is None:
122 |             fps = get_fps(input_video_path)
123 |     
124 |     if downsample > 1:
125 |         keypoints = downsample_tensor(keypoints, downsample)
126 |         all_frames = downsample_tensor(np.array(all_frames), downsample).astype('uint8')
127 |         for idx in range(len(poses)):
128 |             poses[idx] = downsample_tensor(poses[idx], downsample)
129 |             trajectories[idx] = downsample_tensor(trajectories[idx], downsample)
130 |         fps /= downsample
131 | 
132 |     initialized = False
133 |     image = None
134 |     lines = []
135 |     points = None
136 |     
137 |     if limit < 1:
138 |         limit = len(all_frames)
139 |     else:
140 |         limit = min(limit, len(all_frames))
141 | 
142 |     parents = skeleton.parents()
143 |     def update_video(i):
144 |         nonlocal initialized, image, lines, points
145 | 
146 |         for n, ax in enumerate(ax_3d):
147 |             ax.set_xlim3d([-radius/2 + trajectories[n][i, 0], radius/2 + trajectories[n][i, 0]])
148 |             ax.set_ylim3d([-radius/2 + trajectories[n][i, 1], radius/2 + trajectories[n][i, 1]])
149 | 
150 |         # Update 2D poses
151 |         joints_right_2d = keypoints_metadata['keypoints_symmetry'][1]
152 |         colors_2d = np.full(keypoints.shape[1], 'black')
153 |         colors_2d[joints_right_2d] = 'red'
154 |         if not initialized:
155 |             image = ax_in.imshow(all_frames[i], aspect='equal')
156 |             
157 |             for j, j_parent in enumerate(parents):
158 |                 if j_parent == -1:
159 |                     continue
160 |                     
161 |                 if len(parents) == keypoints.shape[1] and keypoints_metadata['layout_name'] != 'coco':
162 |                     # Draw skeleton only if keypoints match (otherwise we don't have the parents definition)
163 |                     lines.append(ax_in.plot([keypoints[i, j, 0], keypoints[i, j_parent, 0]],
164 |                                             [keypoints[i, j, 1], keypoints[i, j_parent, 1]], color='pink'))
165 | 
166 |                 col = 'red' if j in skeleton.joints_right() else 'black'
167 |                 for n, ax in enumerate(ax_3d):
168 |                     pos = poses[n][i]
169 |                     lines_3d[n].append(ax.plot([pos[j, 0], pos[j_parent, 0]],
170 |                                                [pos[j, 1], pos[j_parent, 1]],
171 |                                                [pos[j, 2], pos[j_parent, 2]], zdir='z', c=col))
172 | 
173 |             points = ax_in.scatter(*keypoints[i].T, 10, color=colors_2d, edgecolors='white', zorder=10)
174 | 
175 |             initialized = True
176 |         else:
177 |             image.set_data(all_frames[i])
178 | 
179 |             for j, j_parent in enumerate(parents):
180 |                 if j_parent == -1:
181 |                     continue
182 |                 
183 |                 if len(parents) == keypoints.shape[1] and keypoints_metadata['layout_name'] != 'coco':
184 |                     lines[j-1][0].set_data([keypoints[i, j, 0], keypoints[i, j_parent, 0]],
185 |                                            [keypoints[i, j, 1], keypoints[i, j_parent, 1]])
186 | 
187 |                 for n, ax in enumerate(ax_3d):
188 |                     pos = poses[n][i]
189 |                     lines_3d[n][j-1][0].set_xdata(np.array([pos[j, 0], pos[j_parent, 0]]))
190 |                     lines_3d[n][j-1][0].set_ydata(np.array([pos[j, 1], pos[j_parent, 1]]))
191 |                     lines_3d[n][j-1][0].set_3d_properties(np.array([pos[j, 2], pos[j_parent, 2]]), zdir='z')
192 | 
193 |             points.set_offsets(keypoints[i])
194 |         
195 |         print('{}/{}      '.format(i, limit), end='\r')
196 |         
197 | 
198 |     fig.tight_layout()
199 |     
200 |     anim = FuncAnimation(fig, update_video, frames=np.arange(0, limit), interval=1000/fps, repeat=False)
201 |     if output.endswith('.mp4'):
202 |         Writer = writers['ffmpeg']
203 |         writer = Writer(fps=fps, metadata={}, bitrate=bitrate)
204 |         anim.save(output, writer=writer)
205 |     elif output.endswith('.gif'):
206 |         anim.save(output, dpi=80, writer='imagemagick')
207 |     else:
208 |         raise ValueError('Unsupported output format (only .mp4 and .gif are supported)')
209 |     plt.close()


--------------------------------------------------------------------------------
/data/ConvertHumanEva.m:
--------------------------------------------------------------------------------
  1 | % Copyright (c) 2018-present, Facebook, Inc.
  2 | % All rights reserved.
  3 | %
  4 | % This source code is licensed under the license found in the
  5 | % LICENSE file in the root directory of this source tree.
  6 | %
  7 | 
  8 | function [] = ConvertDataset()
  9 | 
 10 |     N_JOINTS = 15; % Set to 20 if you want to export a 20-joint skeleton
 11 | 
 12 |     function [pose_out] = ExtractPose15(pose, dimensions)
 13 |         % We use the same 15-joint skeleton as in the evaluation
 14 |         % script "@body_pose/error.m". Proximal and Distal joints
 15 |         % are averaged.
 16 |         pose_out = NaN(15, dimensions);
 17 |         pose_out(1, :) = pose.torsoDistal; % Pelvis (root)
 18 |         pose_out(2, :) = (pose.torsoProximal + pose.headProximal) / 2; % Thorax
 19 |         pose_out(3, :) = pose.upperLArmProximal; % Left shoulder
 20 |         pose_out(4, :) = (pose.upperLArmDistal + pose.lowerLArmProximal) / 2; % Left elbow
 21 |         pose_out(5, :) = pose.lowerLArmDistal; % Left wrist
 22 |         pose_out(6, :) = pose.upperRArmProximal; % Right shoulder 
 23 |         pose_out(7, :) = (pose.upperRArmDistal + pose.lowerRArmProximal) / 2; % Right elbow
 24 |         pose_out(8, :) = pose.lowerRArmDistal; % Right wrist
 25 |         pose_out(9, :) = pose.upperLLegProximal; % Left hip
 26 |         pose_out(10, :) = (pose.upperLLegDistal + pose.lowerLLegProximal) / 2; % Left knee
 27 |         pose_out(11, :) = pose.lowerLLegDistal; % Left ankle 
 28 |         pose_out(12, :) = pose.upperRLegProximal; % Right hip
 29 |         pose_out(13, :) = (pose.upperRLegDistal + pose.lowerRLegProximal) / 2; % Right knee
 30 |         pose_out(14, :) = pose.lowerRLegDistal; % Right ankle 
 31 |         pose_out(15, :) = pose.headDistal; % Head
 32 |     end
 33 | 
 34 |     function [pose_out] = ExtractPose20(pose, dimensions)
 35 |         pose_out = NaN(20, dimensions);
 36 |         pose_out(1, :) = pose.torsoDistal; % Pelvis (root)
 37 |         pose_out(2, :) = pose.torsoProximal;
 38 |         pose_out(3, :) = pose.headProximal;
 39 |         pose_out(4, :) = pose.upperLArmProximal; % Left shoulder
 40 |         pose_out(5, :) = pose.upperLArmDistal;
 41 |         pose_out(6, :) = pose.lowerLArmProximal;
 42 |         pose_out(7, :) = pose.lowerLArmDistal; % Left wrist
 43 |         pose_out(8, :) = pose.upperRArmProximal; % Right shoulder 
 44 |         pose_out(9, :) = pose.upperRArmDistal;
 45 |         pose_out(10, :) = pose.lowerRArmProximal;
 46 |         pose_out(11, :) = pose.lowerRArmDistal; % Right wrist
 47 |         pose_out(12, :) = pose.upperLLegProximal; % Left hip
 48 |         pose_out(13, :) = pose.upperLLegDistal;
 49 |         pose_out(14, :) = pose.lowerLLegProximal;
 50 |         pose_out(15, :) = pose.lowerLLegDistal; % Left ankle 
 51 |         pose_out(16, :) = pose.upperRLegProximal; % Right hip
 52 |         pose_out(17, :) = pose.upperRLegDistal;
 53 |         pose_out(18, :) = pose.lowerRLegProximal;
 54 |         pose_out(19, :) = pose.lowerRLegDistal; % Right ankle 
 55 |         pose_out(20, :) = pose.headDistal; % Head
 56 |     end
 57 | 
 58 |     addpath('./TOOLBOX_calib/');
 59 |     addpath('./TOOLBOX_common/');
 60 |     addpath('./TOOLBOX_dxAvi/');
 61 |     addpath('./TOOLBOX_readc3d/'); 
 62 | 
 63 |     % Create the output directory for the converted dataset
 64 |     OUT_DIR = ['./converted_', int2str(N_JOINTS), 'j'];
 65 |     warning('off', 'MATLAB:MKDIR:DirectoryExists');
 66 |     mkdir(OUT_DIR);
 67 | 
 68 |     % We use the validation set as the test set
 69 |     for SPLIT = {'Train', 'Validate'}
 70 |         mkdir([OUT_DIR, '/', SPLIT{1}]);
 71 |         CurrentDataset = he_dataset('HumanEvaI', SPLIT{1});
 72 | 
 73 |         for SEQ = 1:length(CurrentDataset)
 74 | 
 75 |             Subject = char(get(CurrentDataset(SEQ), 'SubjectName'));
 76 |             Action = char(get(CurrentDataset(SEQ), 'ActionType'));
 77 |             Trial = char(get(CurrentDataset(SEQ), 'Trial'));
 78 |             DatasetBasePath = char(get(CurrentDataset(SEQ), 'DatasetBasePath'));
 79 |             if Trial ~= '1'
 80 |                 % We are only interested in fully-annotated data
 81 |                 continue;
 82 |             end
 83 | 
 84 |             if strcmp(Action, 'ThrowCatch') && strcmp(Subject, 'S3')
 85 |                 % Damaged mocap stream
 86 |                 continue;
 87 |             end
 88 | 
 89 |             fprintf('Converting...\n')
 90 |             fprintf('\tSplit: %s\n', SPLIT{1});
 91 |             fprintf('\tSubject: %s\n', Subject);
 92 |             fprintf('\tAction: %s\n', Action);    
 93 |             fprintf('\tTrial: %s\n', Trial);
 94 | 
 95 |             % Create subject directory if it does not exist
 96 |             mkdir([OUT_DIR, '/', SPLIT{1}, '/', Subject]);
 97 | 
 98 |             % Load the sequence
 99 |             [~, ~, MocapStream, MocapStream_Enabled] ...
100 |                                     = sync_stream(CurrentDataset(SEQ));
101 | 
102 |             % Set frame range
103 |             FrameStart = get(CurrentDataset(SEQ), 'FrameStart');
104 |             FrameStart = [FrameStart{:}];
105 |             FrameEnd   = get(CurrentDataset(SEQ), 'FrameEnd');    
106 |             FrameEnd   = [FrameEnd{:}];
107 | 
108 |             fprintf('\tNum. frames: %d\n', FrameEnd - FrameStart + 1);
109 |             poses_3d = NaN(FrameEnd - FrameStart + 1, N_JOINTS, 3);
110 |             poses_2d = NaN(3, FrameEnd - FrameStart + 1, N_JOINTS, 2);
111 |             corrupt = 0;
112 |             for FRAME = FrameStart:FrameEnd
113 | 
114 |                 if (MocapStream_Enabled)
115 |                     [MocapStream, pose, ValidPose] = cur_frame(MocapStream, FRAME, 'body_pose');
116 | 
117 |                     if (ValidPose)
118 |                         i = FRAME - FrameStart + 1;
119 |                         
120 |                         % Extract 3D pose
121 |                         if N_JOINTS == 15
122 |                             poses_3d(i, :, :) = ExtractPose15(pose, 3);
123 |                         else
124 |                             poses_3d(i, :, :) = ExtractPose20(pose, 3);
125 |                         end
126 |                         
127 |                         % Extract ground-truth 2D pose via camera
128 |                         % projection
129 |                         for CAM = 1:3
130 |                             if (CAM == 1)
131 |                                 CameraName = 'C1';
132 |                             elseif (CAM == 2)
133 |                                 CameraName = 'C2';    
134 |                             elseif (CAM == 3)
135 |                                 CameraName = 'C3';
136 |                             end
137 |                             CalibrationFilename = [DatasetBasePath, Subject, '/Calibration_Data/', CameraName,  '.cal'];
138 |                             pose_2d = project2d(pose, CalibrationFilename);
139 |                             if N_JOINTS == 15
140 |                                 poses_2d(CAM, i, :, :) = ExtractPose15(pose_2d, 2);
141 |                             else
142 |                                 poses_2d(CAM, i, :, :) = ExtractPose20(pose_2d, 2);
143 |                             end
144 |                         end
145 |                         
146 |                     else
147 |                         corrupt = corrupt + 1;
148 |                     end
149 |                 end
150 |             end
151 |             fprintf('\n%d out of %d frames are damaged\n', corrupt, FrameEnd - FrameStart + 1);
152 |             FileName = [OUT_DIR, '/', SPLIT{1}, '/', Subject, '/', Action, '_', Trial, '.mat'];
153 |             save(FileName, 'poses_3d', 'poses_2d');
154 |             fprintf('... saved to %s\n\n', FileName);
155 |         end
156 |     end
157 | end
158 | 


--------------------------------------------------------------------------------
/data/convert_cdf_to_mat.m:
--------------------------------------------------------------------------------
 1 | % Copyright (c) 2018-present, Facebook, Inc.
 2 | % All rights reserved.
 3 | %
 4 | % This source code is licensed under the license found in the
 5 | % LICENSE file in the root directory of this source tree.
 6 | %
 7 | 
 8 | % Extract "Poses_D3_Positions_S*.tgz" to the "pose" directory
 9 | % and run this script to convert all .cdf files to .mat
10 | 
11 | pose_directory = 'pose';
12 | dirs = dir(strcat(pose_directory, '/*/MyPoseFeatures/D3_Positions/*.cdf'));
13 | 
14 | paths = {dirs.folder};
15 | names = {dirs.name};
16 | 
17 | for i = 1:numel(names)
18 |     data = cdfread(strcat(paths{i}, '/', names{i}));
19 |     save(strcat(paths{i}, '/', names{i}, '.mat'), 'data');
20 | end


--------------------------------------------------------------------------------
/data/data_utils.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2018-present, Facebook, Inc.
  2 | # All rights reserved.
  3 | #
  4 | # This source code is licensed under the license found in the
  5 | # LICENSE file in the root directory of this source tree.
  6 | #
  7 | 
  8 | import numpy as np
  9 | 
 10 | mpii_metadata = {
 11 |     'layout_name': 'mpii',
 12 |     'num_joints': 16,
 13 |     'keypoints_symmetry': [
 14 |         [3, 4, 5, 13, 14, 15],
 15 |         [0, 1, 2, 10, 11, 12],
 16 |     ]
 17 | }
 18 | 
 19 | coco_metadata = {
 20 |     'layout_name': 'coco',
 21 |     'num_joints': 17,
 22 |     'keypoints_symmetry': [
 23 |         [1, 3, 5, 7, 9, 11, 13, 15],
 24 |         [2, 4, 6, 8, 10, 12, 14, 16],
 25 |     ]
 26 | }
 27 | 
 28 | h36m_metadata = {
 29 |     'layout_name': 'h36m',
 30 |     'num_joints': 17,
 31 |     'keypoints_symmetry': [
 32 |         [4, 5, 6, 11, 12, 13],
 33 |         [1, 2, 3, 14, 15, 16],
 34 |     ]
 35 | }
 36 | 
 37 | humaneva15_metadata = {
 38 |     'layout_name': 'humaneva15',
 39 |     'num_joints': 15,
 40 |     'keypoints_symmetry': [
 41 |         [2, 3, 4, 8, 9, 10],
 42 |         [5, 6, 7, 11, 12, 13]
 43 |     ]
 44 | }
 45 | 
 46 | humaneva20_metadata = {
 47 |     'layout_name': 'humaneva20',
 48 |     'num_joints': 20,
 49 |     'keypoints_symmetry': [
 50 |         [3, 4, 5, 6, 11, 12, 13, 14],
 51 |         [7, 8, 9, 10, 15, 16, 17, 18]
 52 |     ]
 53 | }
 54 | 
 55 | def suggest_metadata(name):
 56 |     names = []
 57 |     for metadata in [mpii_metadata, coco_metadata, h36m_metadata, humaneva15_metadata, humaneva20_metadata]:
 58 |         if metadata['layout_name'] in name:
 59 |             return metadata
 60 |         names.append(metadata['layout_name'])
 61 |     raise KeyError('Cannot infer keypoint layout from name "{}". Tried {}.'.format(name, names))
 62 | 
 63 | def import_detectron_poses(path):
 64 |     # Latin1 encoding because Detectron runs on Python 2.7
 65 |     data = np.load(path, encoding='latin1')
 66 |     kp = data['keypoints']
 67 |     bb = data['boxes']
 68 |     results = []
 69 |     for i in range(len(bb)):
 70 |         if len(bb[i][1]) == 0:
 71 |             assert i > 0
 72 |             # Use last pose in case of detection failure
 73 |             results.append(results[-1])
 74 |             continue
 75 |         best_match = np.argmax(bb[i][1][:, 4])
 76 |         keypoints = kp[i][1][best_match].T.copy()
 77 |         results.append(keypoints)
 78 |     results = np.array(results)
 79 |     return results[:, :, 4:6] # Soft-argmax
 80 |     #return results[:, :, [0, 1, 3]] # Argmax + score
 81 |     
 82 |     
 83 | def import_cpn_poses(path):
 84 |     data = np.load(path)
 85 |     kp = data['keypoints']
 86 |     return kp[:, :, :2]
 87 |     
 88 |     
 89 | def import_sh_poses(path):
 90 |     import h5py
 91 |     with h5py.File(path) as hf:
 92 |         positions = hf['poses'].value
 93 |     return positions.astype('float32')
 94 |     
 95 | def suggest_pose_importer(name):
 96 |     if 'detectron' in name:
 97 |         return import_detectron_poses
 98 |     if 'cpn' in name:
 99 |         return import_cpn_poses
100 |     if 'sh' in name:
101 |         return import_sh_poses
102 |     raise KeyError('Cannot infer keypoint format from name "{}". Tried detectron, cpn, sh.'.format(name))
103 | 


--------------------------------------------------------------------------------
/data/prepare_data_2d_custom.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2018-present, Facebook, Inc.
 2 | # All rights reserved.
 3 | #
 4 | # This source code is licensed under the license found in the
 5 | # LICENSE file in the root directory of this source tree.
 6 | #
 7 | 
 8 | import numpy as np
 9 | from glob import glob
10 | import os
11 | import sys
12 | 
13 | import argparse
14 | from data_utils import suggest_metadata
15 | 
16 | output_prefix_2d = 'data_2d_custom_'
17 | 
18 | def decode(filename):
19 |     # Latin1 encoding because Detectron runs on Python 2.7
20 |     print('Processing {}'.format(filename))
21 |     data = np.load(filename, encoding='latin1', allow_pickle=True)
22 |     bb = data['boxes']
23 |     kp = data['keypoints']
24 |     metadata = data['metadata'].item()
25 |     results_bb = []
26 |     results_kp = []
27 |     for i in range(len(bb)):
28 |         if len(bb[i][1]) == 0 or len(kp[i][1]) == 0:
29 |             # No bbox/keypoints detected for this frame -> will be interpolated
30 |             results_bb.append(np.full(4, np.nan, dtype=np.float32)) # 4 bounding box coordinates
31 |             results_kp.append(np.full((17, 4), np.nan, dtype=np.float32)) # 17 COCO keypoints
32 |             continue
33 |         best_match = np.argmax(bb[i][1][:, 4])
34 |         best_bb = bb[i][1][best_match, :4]
35 |         best_kp = kp[i][1][best_match].T.copy()
36 |         results_bb.append(best_bb)
37 |         results_kp.append(best_kp)
38 |         
39 |     bb = np.array(results_bb, dtype=np.float32)
40 |     kp = np.array(results_kp, dtype=np.float32)
41 |     kp = kp[:, :, :2] # Extract (x, y)
42 |     
43 |     # Fix missing bboxes/keypoints by linear interpolation
44 |     mask = ~np.isnan(bb[:, 0])
45 |     indices = np.arange(len(bb))
46 |     for i in range(4):
47 |         bb[:, i] = np.interp(indices, indices[mask], bb[mask, i])
48 |     for i in range(17):
49 |         for j in range(2):
50 |             kp[:, i, j] = np.interp(indices, indices[mask], kp[mask, i, j])
51 |     
52 |     print('{} total frames processed'.format(len(bb)))
53 |     print('{} frames were interpolated'.format(np.sum(~mask)))
54 |     print('----------')
55 |     
56 |     return [{
57 |         'start_frame': 0, # Inclusive
58 |         'end_frame': len(kp), # Exclusive
59 |         'bounding_boxes': bb,
60 |         'keypoints': kp,
61 |     }], metadata
62 | 
63 | 
64 | if __name__ == '__main__':
65 |     if os.path.basename(os.getcwd()) != 'data':
66 |         print('This script must be launched from the "data" directory')
67 |         exit(0)
68 |         
69 |     parser = argparse.ArgumentParser(description='Custom dataset creator')
70 |     parser.add_argument('-i', '--input', type=str, default='', metavar='PATH', help='detections directory')
71 |     parser.add_argument('-o', '--output', type=str, default='', metavar='PATH', help='output suffix for 2D detections')
72 |     args = parser.parse_args()
73 |     
74 |     if not args.input:
75 |         print('Please specify the input directory')
76 |         exit(0)
77 |         
78 |     if not args.output:
79 |         print('Please specify an output suffix (e.g. detectron_pt_coco)')
80 |         exit(0)
81 |     
82 |     print('Parsing 2D detections from', args.input)
83 |     
84 |     metadata = suggest_metadata('coco')
85 |     metadata['video_metadata'] = {}
86 |     
87 |     output = {}
88 |     file_list = glob(args.input + '/*.npz')
89 |     for f in file_list:
90 |         canonical_name = os.path.splitext(os.path.basename(f))[0]
91 |         data, video_metadata = decode(f)
92 |         output[canonical_name] = {}
93 |         output[canonical_name]['custom'] = [data[0]['keypoints'].astype('float32')]
94 |         metadata['video_metadata'][canonical_name] = video_metadata
95 | 
96 |     print('Saving...')
97 |     np.savez_compressed(output_prefix_2d + args.output, positions_2d=output, metadata=metadata)
98 |     print('Done.')


--------------------------------------------------------------------------------
/data/prepare_data_2d_h36m_generic.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2018-present, Facebook, Inc.
 2 | # All rights reserved.
 3 | #
 4 | # This source code is licensed under the license found in the
 5 | # LICENSE file in the root directory of this source tree.
 6 | #
 7 | 
 8 | import argparse
 9 | import os
10 | import zipfile
11 | import numpy as np
12 | import h5py
13 | import re
14 | from glob import glob
15 | from shutil import rmtree
16 | from data_utils import suggest_metadata, suggest_pose_importer
17 | 
18 | import sys
19 | sys.path.append('../')
20 | from common.utils import wrap
21 | from itertools import groupby
22 | 
23 | output_prefix_2d = 'data_2d_h36m_'
24 | cam_map = {
25 |     '54138969': 0,
26 |     '55011271': 1,
27 |     '58860488': 2,
28 |     '60457274': 3,
29 | }
30 | 
31 | if __name__ == '__main__':
32 |     if os.path.basename(os.getcwd()) != 'data':
33 |         print('This script must be launched from the "data" directory')
34 |         exit(0)
35 |         
36 |     parser = argparse.ArgumentParser(description='Human3.6M dataset converter')
37 |     
38 |     parser.add_argument('-i', '--input', default='', type=str, metavar='PATH', help='input path to 2D detections')
39 |     parser.add_argument('-o', '--output', default='', type=str, metavar='PATH', help='output suffix for 2D detections (e.g. detectron_pt_coco)')
40 |     
41 |     args = parser.parse_args()
42 |         
43 |     if not args.input:
44 |         print('Please specify the input directory')
45 |         exit(0)
46 |         
47 |     if not args.output:
48 |         print('Please specify an output suffix (e.g. detectron_pt_coco)')
49 |         exit(0)
50 | 
51 |     import_func = suggest_pose_importer(args.output)
52 |     metadata = suggest_metadata(args.output)
53 | 
54 |     print('Parsing 2D detections from', args.input)
55 | 
56 |     output = {}
57 |     file_list = glob(args.input + '/S*/*.mp4.npz')
58 |     for f in file_list:
59 |         path, fname = os.path.split(f)
60 |         subject = os.path.basename(path)
61 |         assert subject.startswith('S'), subject + ' does not look like a subject directory'
62 | 
63 |         if '_ALL' in fname:
64 |             continue
65 |         
66 |         m = re.search('(.*)\\.([0-9]+)\\.mp4\\.npz', fname)
67 |         action = m.group(1)
68 |         camera = m.group(2)
69 |         camera_idx = cam_map[camera]
70 |         
71 |         if subject == 'S11' and action == 'Directions':
72 |             continue # Discard corrupted video
73 |         
74 |         # Use consistent naming convention
75 |         canonical_name = action.replace('TakingPhoto', 'Photo') \
76 |                                .replace('WalkingDog', 'WalkDog')
77 | 
78 |         keypoints = import_func(f)
79 |         assert keypoints.shape[1] == metadata['num_joints']
80 |         
81 |         if subject not in output:
82 |             output[subject] = {}
83 |         if canonical_name not in output[subject]:
84 |             output[subject][canonical_name] = [None, None, None, None]
85 |         output[subject][canonical_name][camera_idx] = keypoints.astype('float32')
86 | 
87 |     print('Saving...')
88 |     np.savez_compressed(output_prefix_2d + args.output, positions_2d=output, metadata=metadata)
89 |     print('Done.')


--------------------------------------------------------------------------------
/data/prepare_data_2d_h36m_sh.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2018-present, Facebook, Inc.
  2 | # All rights reserved.
  3 | #
  4 | # This source code is licensed under the license found in the
  5 | # LICENSE file in the root directory of this source tree.
  6 | #
  7 | 
  8 | import argparse
  9 | import os
 10 | import zipfile
 11 | import tarfile
 12 | import numpy as np
 13 | import h5py
 14 | from glob import glob
 15 | from shutil import rmtree
 16 | 
 17 | import sys
 18 | sys.path.append('../')
 19 | from common.h36m_dataset import Human36mDataset
 20 | from common.camera import world_to_camera, project_to_2d, image_coordinates
 21 | from common.utils import wrap
 22 | 
 23 | output_filename_pt = 'data_2d_h36m_sh_pt_mpii'
 24 | output_filename_ft = 'data_2d_h36m_sh_ft_h36m'
 25 | subjects = ['S1', 'S5', 'S6', 'S7', 'S8', 'S9', 'S11']
 26 | cam_map = {
 27 |     '54138969': 0,
 28 |     '55011271': 1,
 29 |     '58860488': 2,
 30 |     '60457274': 3,
 31 | }
 32 | 
 33 | metadata = {
 34 |     'num_joints': 16,
 35 |     'keypoints_symmetry': [
 36 |         [3, 4, 5, 13, 14, 15],
 37 |         [0, 1, 2, 10, 11, 12],
 38 |     ]
 39 | }
 40 | 
 41 | def process_subject(subject, file_list, output):
 42 |     if subject == 'S11':
 43 |         assert len(file_list) == 119, "Expected 119 files for subject " + subject + ", got " + str(len(file_list))
 44 |     else:
 45 |         assert len(file_list) == 120, "Expected 120 files for subject " + subject + ", got " + str(len(file_list))
 46 |         
 47 |     for f in file_list:
 48 |         action, cam = os.path.splitext(os.path.basename(f))[0].replace('_', ' ').split('.')
 49 |         
 50 |         if subject == 'S11' and action == 'Directions':
 51 |             continue # Discard corrupted video
 52 |         
 53 |         if action not in output[subject]:
 54 |             output[subject][action] = [None, None, None, None]
 55 |         
 56 |         with h5py.File(f) as hf:
 57 |             positions = hf['poses'].value
 58 |             output[subject][action][cam_map[cam]] = positions.astype('float32')    
 59 | 
 60 | if __name__ == '__main__':
 61 |     if os.path.basename(os.getcwd()) != 'data':
 62 |         print('This script must be launched from the "data" directory')
 63 |         exit(0)
 64 |         
 65 |     parser = argparse.ArgumentParser(description='Human3.6M dataset downloader/converter')
 66 |     
 67 |     parser.add_argument('-pt', '--pretrained', default='', type=str, metavar='PATH', help='convert pretrained dataset')
 68 |     parser.add_argument('-ft', '--fine-tuned', default='', type=str, metavar='PATH', help='convert fine-tuned dataset')
 69 |     
 70 |     args = parser.parse_args()
 71 |         
 72 |     if args.pretrained:
 73 |         print('Converting pretrained dataset from', args.pretrained)
 74 |         print('Extracting...')
 75 |         with zipfile.ZipFile(args.pretrained, 'r') as archive:
 76 |             archive.extractall('sh_pt')
 77 |         
 78 |         print('Converting...')
 79 |         output = {}
 80 |         for subject in subjects:
 81 |             output[subject] = {}
 82 |             file_list = glob('sh_pt/h36m/' + subject + '/StackedHourglass/*.h5')
 83 |             process_subject(subject, file_list, output)
 84 |         
 85 |         print('Saving...')
 86 |         np.savez_compressed(output_filename_pt, positions_2d=output, metadata=metadata)
 87 |         
 88 |         print('Cleaning up...')
 89 |         rmtree('sh_pt')
 90 |         
 91 |         print('Done.')
 92 |         
 93 |     if args.fine_tuned:
 94 |         print('Converting fine-tuned dataset from', args.fine_tuned)
 95 |         print('Extracting...')
 96 |         with tarfile.open(args.fine_tuned, 'r:gz') as archive:
 97 |             archive.extractall('sh_ft')
 98 |         
 99 |         print('Converting...')
100 |         output = {}
101 |         for subject in subjects:
102 |             output[subject] = {}
103 |             file_list = glob('sh_ft/' + subject + '/StackedHourglassFineTuned240/*.h5')
104 |             process_subject(subject, file_list, output)
105 |         
106 |         print('Saving...')
107 |         np.savez_compressed(output_filename_ft, positions_2d=output, metadata=metadata)
108 |         
109 |         print('Cleaning up...')
110 |         rmtree('sh_ft')
111 |         
112 |         print('Done.')
113 | 


--------------------------------------------------------------------------------
/data/prepare_data_h36m.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2018-present, Facebook, Inc.
  2 | # All rights reserved.
  3 | #
  4 | # This source code is licensed under the license found in the
  5 | # LICENSE file in the root directory of this source tree.
  6 | #
  7 | 
  8 | import argparse
  9 | import os
 10 | import zipfile
 11 | import numpy as np
 12 | import h5py
 13 | from glob import glob
 14 | from shutil import rmtree
 15 | 
 16 | import sys
 17 | sys.path.append('../')
 18 | from common.h36m_dataset import Human36mDataset
 19 | from common.camera import world_to_camera, project_to_2d, image_coordinates
 20 | from common.utils import wrap
 21 | 
 22 | output_filename = 'data_3d_h36m'
 23 | output_filename_2d = 'data_2d_h36m_gt'
 24 | subjects = ['S1', 'S5', 'S6', 'S7', 'S8', 'S9', 'S11']
 25 | 
 26 | if __name__ == '__main__':
 27 |     if os.path.basename(os.getcwd()) != 'data':
 28 |         print('This script must be launched from the "data" directory')
 29 |         exit(0)
 30 |         
 31 |     parser = argparse.ArgumentParser(description='Human3.6M dataset downloader/converter')
 32 |     
 33 |     # Convert dataset preprocessed by Martinez et al. in https://github.com/una-dinosauria/3d-pose-baseline
 34 |     parser.add_argument('--from-archive', default='', type=str, metavar='PATH', help='convert preprocessed dataset')
 35 |     
 36 |     # Convert dataset from original source, using files converted to .mat (the Human3.6M dataset path must be specified manually)
 37 |     # This option requires MATLAB to convert files using the provided script
 38 |     parser.add_argument('--from-source', default='', type=str, metavar='PATH', help='convert original dataset')
 39 |     
 40 |     # Convert dataset from original source, using original .cdf files (the Human3.6M dataset path must be specified manually)
 41 |     # This option does not require MATLAB, but the Python library cdflib must be installed
 42 |     parser.add_argument('--from-source-cdf', default='', type=str, metavar='PATH', help='convert original dataset')
 43 |     
 44 |     args = parser.parse_args()
 45 |     
 46 |     if args.from_archive and args.from_source:
 47 |         print('Please specify only one argument')
 48 |         exit(0)
 49 |     
 50 |     if os.path.exists(output_filename + '.npz'):
 51 |         print('The dataset already exists at', output_filename + '.npz')
 52 |         exit(0)
 53 |         
 54 |     if args.from_archive:
 55 |         print('Extracting Human3.6M dataset from', args.from_archive)
 56 |         with zipfile.ZipFile(args.from_archive, 'r') as archive:
 57 |             archive.extractall()
 58 |         
 59 |         print('Converting...')
 60 |         output = {}
 61 |         for subject in subjects:
 62 |             output[subject] = {}
 63 |             file_list = glob('h36m/' + subject + '/MyPoses/3D_positions/*.h5')
 64 |             assert len(file_list) == 30, "Expected 30 files for subject " + subject + ", got " + str(len(file_list))
 65 |             for f in file_list:
 66 |                 action = os.path.splitext(os.path.basename(f))[0]
 67 |                 
 68 |                 if subject == 'S11' and action == 'Directions':
 69 |                     continue # Discard corrupted video
 70 |                 
 71 |                 with h5py.File(f) as hf:
 72 |                     positions = hf['3D_positions'].value.reshape(32, 3, -1).transpose(2, 0, 1)
 73 |                     positions /= 1000 # Meters instead of millimeters
 74 |                     output[subject][action] = positions.astype('float32')
 75 |         
 76 |         print('Saving...')
 77 |         np.savez_compressed(output_filename, positions_3d=output)
 78 |         
 79 |         print('Cleaning up...')
 80 |         rmtree('h36m')
 81 |         
 82 |         print('Done.')
 83 |                 
 84 |     elif args.from_source:
 85 |         print('Converting original Human3.6M dataset from', args.from_source)
 86 |         output = {}
 87 |         
 88 |         from scipy.io import loadmat
 89 |         
 90 |         for subject in subjects:
 91 |             output[subject] = {}
 92 |             file_list = glob(args.from_source + '/' + subject + '/MyPoseFeatures/D3_Positions/*.cdf.mat')
 93 |             assert len(file_list) == 30, "Expected 30 files for subject " + subject + ", got " + str(len(file_list))
 94 |             for f in file_list:
 95 |                 action = os.path.splitext(os.path.splitext(os.path.basename(f))[0])[0]
 96 |                 
 97 |                 if subject == 'S11' and action == 'Directions':
 98 |                     continue # Discard corrupted video
 99 |                     
100 |                 # Use consistent naming convention
101 |                 canonical_name = action.replace('TakingPhoto', 'Photo') \
102 |                                        .replace('WalkingDog', 'WalkDog')
103 |                 
104 |                 hf = loadmat(f)
105 |                 positions = hf['data'][0, 0].reshape(-1, 32, 3)
106 |                 positions /= 1000 # Meters instead of millimeters
107 |                 output[subject][canonical_name] = positions.astype('float32')
108 |         
109 |         print('Saving...')
110 |         np.savez_compressed(output_filename, positions_3d=output)
111 |         
112 |         print('Done.')
113 |         
114 |     elif args.from_source_cdf:
115 |         print('Converting original Human3.6M dataset from', args.from_source_cdf, '(CDF files)')
116 |         output = {}
117 |         
118 |         import cdflib
119 |         
120 |         for subject in subjects:
121 |             output[subject] = {}
122 |             file_list = glob(args.from_source_cdf + '/' + subject + '/MyPoseFeatures/D3_Positions/*.cdf')
123 |             assert len(file_list) == 30, "Expected 30 files for subject " + subject + ", got " + str(len(file_list))
124 |             for f in file_list:
125 |                 action = os.path.splitext(os.path.basename(f))[0]
126 |                 
127 |                 if subject == 'S11' and action == 'Directions':
128 |                     continue # Discard corrupted video
129 |                     
130 |                 # Use consistent naming convention
131 |                 canonical_name = action.replace('TakingPhoto', 'Photo') \
132 |                                        .replace('WalkingDog', 'WalkDog')
133 |                 
134 |                 hf = cdflib.CDF(f)
135 |                 positions = hf['Pose'].reshape(-1, 32, 3)
136 |                 positions /= 1000 # Meters instead of millimeters
137 |                 output[subject][canonical_name] = positions.astype('float32')
138 |         
139 |         print('Saving...')
140 |         np.savez_compressed(output_filename, positions_3d=output)
141 |         
142 |         print('Done.')
143 |             
144 |     else:
145 |         print('Please specify the dataset source')
146 |         exit(0)
147 |         
148 |     # Create 2D pose file
149 |     print('')
150 |     print('Computing ground-truth 2D poses...')
151 |     dataset = Human36mDataset(output_filename + '.npz')
152 |     output_2d_poses = {}
153 |     for subject in dataset.subjects():
154 |         output_2d_poses[subject] = {}
155 |         for action in dataset[subject].keys():
156 |             anim = dataset[subject][action]
157 |             
158 |             positions_2d = []
159 |             for cam in anim['cameras']:
160 |                 pos_3d = world_to_camera(anim['positions'], R=cam['orientation'], t=cam['translation'])
161 |                 pos_2d = wrap(project_to_2d, pos_3d, cam['intrinsic'], unsqueeze=True)
162 |                 pos_2d_pixel_space = image_coordinates(pos_2d, w=cam['res_w'], h=cam['res_h'])
163 |                 positions_2d.append(pos_2d_pixel_space.astype('float32'))
164 |             output_2d_poses[subject][action] = positions_2d
165 |             
166 |     print('Saving...')
167 |     metadata = {
168 |         'num_joints': dataset.skeleton().num_joints(),
169 |         'keypoints_symmetry': [dataset.skeleton().joints_left(), dataset.skeleton().joints_right()]
170 |     }
171 |     np.savez_compressed(output_filename_2d, positions_2d=output_2d_poses, metadata=metadata)
172 |     
173 |     print('Done.')
174 | 


--------------------------------------------------------------------------------
/data/prepare_data_humaneva.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2018-present, Facebook, Inc.
  2 | # All rights reserved.
  3 | #
  4 | # This source code is licensed under the license found in the
  5 | # LICENSE file in the root directory of this source tree.
  6 | #
  7 | 
  8 | import argparse
  9 | import os
 10 | import zipfile
 11 | import numpy as np
 12 | import h5py
 13 | import re
 14 | from glob import glob
 15 | from shutil import rmtree
 16 | from data_utils import suggest_metadata, suggest_pose_importer
 17 | 
 18 | import sys
 19 | sys.path.append('../')
 20 | from common.utils import wrap
 21 | from itertools import groupby
 22 | 
 23 | subjects = ['Train/S1', 'Train/S2', 'Train/S3', 'Validate/S1', 'Validate/S2', 'Validate/S3']
 24 | 
 25 | cam_map = {
 26 |     'C1': 0,
 27 |     'C2': 1,
 28 |     'C3': 2,
 29 | }
 30 | 
 31 | # Frame numbers for train/test split
 32 | # format: [start_frame, end_frame[ (inclusive, exclusive)
 33 | index = {
 34 |     'Train/S1': {
 35 |         'Walking 1': (590, 1203),
 36 |         'Jog 1': (367, 740),
 37 |         'ThrowCatch 1': (473, 945),
 38 |         'Gestures 1': (395, 801),
 39 |         'Box 1': (385, 789),
 40 |     },
 41 |     'Train/S2': {
 42 |         'Walking 1': (438, 876),
 43 |         'Jog 1': (398, 795),
 44 |         'ThrowCatch 1': (550, 1128),
 45 |         'Gestures 1': (500, 901),
 46 |         'Box 1': (382, 734),
 47 |     },
 48 |     'Train/S3': {
 49 |         'Walking 1': (448, 939),
 50 |         'Jog 1': (401, 842),
 51 |         'ThrowCatch 1': (493, 1027),
 52 |         'Gestures 1': (533, 1102),
 53 |         'Box 1': (512, 1021),
 54 |     },
 55 |     'Validate/S1': {
 56 |         'Walking 1': (5, 590),
 57 |         'Jog 1': (5, 367),
 58 |         'ThrowCatch 1': (5, 473),
 59 |         'Gestures 1': (5, 395),
 60 |         'Box 1': (5, 385),
 61 |     },
 62 |     'Validate/S2': {
 63 |         'Walking 1': (5, 438),
 64 |         'Jog 1': (5, 398),
 65 |         'ThrowCatch 1': (5, 550),
 66 |         'Gestures 1': (5, 500),
 67 |         'Box 1': (5, 382),
 68 |     },
 69 |     'Validate/S3': {
 70 |         'Walking 1': (5, 448),
 71 |         'Jog 1': (5, 401),
 72 |         'ThrowCatch 1': (5, 493),
 73 |         'Gestures 1': (5, 533),
 74 |         'Box 1': (5, 512),
 75 |     },
 76 | }
 77 | 
 78 | # Frames to skip for each video (synchronization)
 79 | sync_data = {
 80 |     'S1': {
 81 |         'Walking 1': (82, 81, 82),
 82 |         'Jog 1': (51, 51, 50),
 83 |         'ThrowCatch 1': (61, 61, 60),
 84 |         'Gestures 1': (45, 45, 44),
 85 |         'Box 1': (57, 57, 56),
 86 |     },
 87 |     'S2': {
 88 |         'Walking 1': (115, 115, 114),
 89 |         'Jog 1': (100, 100, 99),
 90 |         'ThrowCatch 1': (127, 127, 127),
 91 |         'Gestures 1': (122, 122, 121),
 92 |         'Box 1': (119, 119, 117),
 93 |     },
 94 |     'S3': {
 95 |         'Walking 1': (80, 80, 80),
 96 |         'Jog 1': (65, 65, 65),
 97 |         'ThrowCatch 1': (79, 79, 79),
 98 |         'Gestures 1': (83, 83, 82),
 99 |         'Box 1': (1, 1, 1),
100 |     },
101 |     'S4': {}
102 | }
103 | 
104 | if __name__ == '__main__':
105 |     if os.path.basename(os.getcwd()) != 'data':
106 |         print('This script must be launched from the "data" directory')
107 |         exit(0)
108 |         
109 |     parser = argparse.ArgumentParser(description='HumanEva dataset converter')
110 |     
111 |     parser.add_argument('-p', '--path', default='', type=str, metavar='PATH', help='path to the processed HumanEva dataset')
112 |     parser.add_argument('--convert-3d', action='store_true', help='convert 3D mocap data')
113 |     parser.add_argument('--convert-2d', default='', type=str, metavar='PATH', help='convert user-supplied 2D detections')
114 |     parser.add_argument('-o', '--output', default='', type=str, metavar='PATH', help='output suffix for 2D detections (e.g. detectron_pt_coco)')
115 |     
116 |     args = parser.parse_args()
117 |         
118 |     if not args.convert_2d and not args.convert_3d:
119 |         print('Please specify one conversion mode')
120 |         exit(0)
121 |         
122 |  
123 |     if args.path:
124 |         print('Parsing HumanEva dataset from', args.path)
125 |         output = {}
126 |         output_2d = {}
127 |         frame_mapping = {}
128 |         
129 |         from scipy.io import loadmat
130 |         
131 |         num_joints = None
132 |         
133 |         for subject in subjects:
134 |             output[subject] = {}
135 |             output_2d[subject] = {}
136 |             split, subject_name = subject.split('/')
137 |             if subject_name not in frame_mapping:
138 |                 frame_mapping[subject_name] = {}
139 |             
140 |             file_list = glob(args.path + '/' + subject + '/*.mat')
141 |             for f in file_list:
142 |                 action = os.path.splitext(os.path.basename(f))[0]
143 |                     
144 |                 # Use consistent naming convention
145 |                 canonical_name = action.replace('_', ' ')
146 |                 
147 |                 hf = loadmat(f)
148 |                 positions = hf['poses_3d']
149 |                 positions_2d = hf['poses_2d'].transpose(1, 0, 2, 3) # Ground-truth 2D poses
150 |                 assert positions.shape[0] == positions_2d.shape[0] and positions.shape[1] == positions_2d.shape[2]
151 |                 assert num_joints is None or num_joints == positions.shape[1], "Joint number inconsistency among files"
152 |                 num_joints = positions.shape[1]
153 |                 
154 |                 # Sanity check for the sequence length
155 |                 assert positions.shape[0] == index[subject][canonical_name][1] - index[subject][canonical_name][0]
156 |                 
157 |                 # Split corrupted motion capture streams into contiguous chunks
158 |                 # e.g. 012XX567X9 is split into "012", "567", and "9".
159 |                 all_chunks = [list(v) for k, v in groupby(positions, lambda x: np.isfinite(x).all())]
160 |                 all_chunks_2d = [list(v) for k, v in groupby(positions_2d, lambda x: np.isfinite(x).all())]
161 |                 assert len(all_chunks) == len(all_chunks_2d)
162 |                 current_index = index[subject][canonical_name][0]
163 |                 chunk_indices = []
164 |                 for i, chunk in enumerate(all_chunks):
165 |                     next_index = current_index + len(chunk)
166 |                     name = canonical_name + ' chunk' + str(i)
167 |                     if np.isfinite(chunk).all():
168 |                         output[subject][name] = np.array(chunk, dtype='float32') / 1000
169 |                         output_2d[subject][name] = list(np.array(all_chunks_2d[i], dtype='float32').transpose(1, 0, 2, 3))
170 |                     chunk_indices.append((current_index, next_index, np.isfinite(chunk).all(), split, name))
171 |                     current_index = next_index
172 |                 assert current_index == index[subject][canonical_name][1]
173 |                 if canonical_name not in frame_mapping[subject_name]:
174 |                     frame_mapping[subject_name][canonical_name] = []
175 |                 frame_mapping[subject_name][canonical_name] += chunk_indices
176 |         
177 |         metadata = suggest_metadata('humaneva' + str(num_joints))
178 |         output_filename = 'data_3d_' + metadata['layout_name']
179 |         output_prefix_2d = 'data_2d_' + metadata['layout_name'] + '_'
180 |         
181 |         if args.convert_3d:
182 |             print('Saving...')
183 |             np.savez_compressed(output_filename, positions_3d=output)
184 |             np.savez_compressed(output_prefix_2d + 'gt', positions_2d=output_2d, metadata=metadata)
185 |             print('Done.')
186 |         
187 |     else:
188 |         print('Please specify the dataset source')
189 |         exit(0)
190 |         
191 |     if args.convert_2d:
192 |         if not args.output:
193 |             print('Please specify an output suffix (e.g. detectron_pt_coco)')
194 |             exit(0)
195 |             
196 |         import_func = suggest_pose_importer(args.output)
197 |         metadata = suggest_metadata(args.output)
198 |             
199 |         print('Parsing 2D detections from', args.convert_2d)
200 |         
201 |         output = {}
202 |         file_list = glob(args.convert_2d + '/S*/*.avi.npz')
203 |         for f in file_list:
204 |             path, fname = os.path.split(f)
205 |             subject = os.path.basename(path)
206 |             assert subject.startswith('S'), subject + ' does not look like a subject directory'
207 |             
208 |             m = re.search('(.*) \\((.*)\\)', fname.replace('_', ' '))
209 |             action = m.group(1)
210 |             camera = m.group(2)
211 |             camera_idx = cam_map[camera]
212 |             
213 |             keypoints = import_func(f)
214 |             assert keypoints.shape[1] == metadata['num_joints']
215 |             
216 |             if action in sync_data[subject]:
217 |                 sync_offset = sync_data[subject][action][camera_idx] - 1
218 |             else:
219 |                 sync_offset = 0
220 | 
221 |             if subject in frame_mapping and action in frame_mapping[subject]:
222 |                 chunks = frame_mapping[subject][action]
223 |                 for (start_idx, end_idx, labeled, split, name) in chunks:
224 |                     canonical_subject = split + '/' + subject
225 |                     if not labeled:
226 |                         canonical_subject = 'Unlabeled/' + canonical_subject
227 |                     if canonical_subject not in output:
228 |                         output[canonical_subject] = {}
229 |                     kps = keypoints[start_idx+sync_offset:end_idx+sync_offset]
230 |                     assert len(kps) == end_idx - start_idx, "Got len {}, expected {}".format(len(kps), end_idx - start_idx)
231 |                     
232 |                     if name not in output[canonical_subject]:
233 |                         output[canonical_subject][name] = [None, None, None]
234 |                     
235 |                     output[canonical_subject][name][camera_idx] = kps.astype('float32')
236 |             else:
237 |                 canonical_subject = 'Unlabeled/' + subject
238 |                 if canonical_subject not in output:
239 |                     output[canonical_subject] = {}
240 |                 if action not in output[canonical_subject]:
241 |                         output[canonical_subject][action] = [None, None, None]
242 |                 output[canonical_subject][action][camera_idx] = keypoints.astype('float32')
243 |                 
244 |         print('Saving...')
245 |         np.savez_compressed(output_prefix_2d + args.output, positions_2d=output, metadata=metadata)
246 |         print('Done.')


--------------------------------------------------------------------------------
/images/batching.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/facebookresearch/VideoPose3D/1afb1ca0f1237776518469876342fc8669d3f6a9/images/batching.png


--------------------------------------------------------------------------------
/images/convolutions_1f_naive.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/facebookresearch/VideoPose3D/1afb1ca0f1237776518469876342fc8669d3f6a9/images/convolutions_1f_naive.png


--------------------------------------------------------------------------------
/images/convolutions_1f_optimized.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/facebookresearch/VideoPose3D/1afb1ca0f1237776518469876342fc8669d3f6a9/images/convolutions_1f_optimized.png


--------------------------------------------------------------------------------
/images/convolutions_anim.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/facebookresearch/VideoPose3D/1afb1ca0f1237776518469876342fc8669d3f6a9/images/convolutions_anim.gif


--------------------------------------------------------------------------------
/images/convolutions_causal.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/facebookresearch/VideoPose3D/1afb1ca0f1237776518469876342fc8669d3f6a9/images/convolutions_causal.png


--------------------------------------------------------------------------------
/images/convolutions_normal.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/facebookresearch/VideoPose3D/1afb1ca0f1237776518469876342fc8669d3f6a9/images/convolutions_normal.png


--------------------------------------------------------------------------------
/images/demo_h36m.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/facebookresearch/VideoPose3D/1afb1ca0f1237776518469876342fc8669d3f6a9/images/demo_h36m.gif


--------------------------------------------------------------------------------
/images/demo_humaneva.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/facebookresearch/VideoPose3D/1afb1ca0f1237776518469876342fc8669d3f6a9/images/demo_humaneva.gif


--------------------------------------------------------------------------------
/images/demo_humaneva_unlabeled.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/facebookresearch/VideoPose3D/1afb1ca0f1237776518469876342fc8669d3f6a9/images/demo_humaneva_unlabeled.gif


--------------------------------------------------------------------------------
/images/demo_temporal.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/facebookresearch/VideoPose3D/1afb1ca0f1237776518469876342fc8669d3f6a9/images/demo_temporal.gif


--------------------------------------------------------------------------------
/images/demo_yt.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/facebookresearch/VideoPose3D/1afb1ca0f1237776518469876342fc8669d3f6a9/images/demo_yt.gif


--------------------------------------------------------------------------------
/inference/infer_video.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2018-present, Facebook, Inc.
  2 | # All rights reserved.
  3 | #
  4 | # This source code is licensed under the license found in the
  5 | # LICENSE file in the root directory of this source tree.
  6 | #
  7 | 
  8 | """Perform inference on a single video or all videos with a certain extension
  9 | (e.g., .mp4) in a folder.
 10 | """
 11 | 
 12 | from infer_simple import *
 13 | import subprocess as sp
 14 | import numpy as np
 15 | 
 16 | def get_resolution(filename):
 17 |     command = ['ffprobe', '-v', 'error', '-select_streams', 'v:0',
 18 |                '-show_entries', 'stream=width,height', '-of', 'csv=p=0', filename]
 19 |     pipe = sp.Popen(command, stdout=sp.PIPE, bufsize=-1)
 20 |     for line in pipe.stdout:
 21 |         w, h = line.decode().strip().split(',')
 22 |         return int(w), int(h)
 23 | 
 24 | def read_video(filename):
 25 |     w, h = get_resolution(filename)
 26 | 
 27 |     command = ['ffmpeg',
 28 |             '-i', filename,
 29 |             '-f', 'image2pipe',
 30 |             '-pix_fmt', 'bgr24',
 31 |             '-vsync', '0',
 32 |             '-vcodec', 'rawvideo', '-']
 33 | 
 34 |     pipe = sp.Popen(command, stdout=sp.PIPE, bufsize=-1)
 35 |     while True:
 36 |         data = pipe.stdout.read(w*h*3)
 37 |         if not data:
 38 |             break
 39 |         yield np.frombuffer(data, dtype='uint8').reshape((h, w, 3))
 40 | 
 41 | 
 42 | def main(args):
 43 | 
 44 |     logger = logging.getLogger(__name__)
 45 |     merge_cfg_from_file(args.cfg)
 46 |     cfg.NUM_GPUS = 1
 47 |     args.weights = cache_url(args.weights, cfg.DOWNLOAD_CACHE)
 48 |     assert_and_infer_cfg(cache_urls=False)
 49 |     model = infer_engine.initialize_model_from_cfg(args.weights)
 50 |     dummy_coco_dataset = dummy_datasets.get_coco_dataset()
 51 | 
 52 | 
 53 | 
 54 |     if os.path.isdir(args.im_or_folder):
 55 |         im_list = glob.iglob(args.im_or_folder + '/*.' + args.image_ext)
 56 |     else:
 57 |         im_list = [args.im_or_folder]
 58 | 
 59 |     for video_name in im_list:
 60 |         out_name = os.path.join(
 61 |                 args.output_dir, os.path.basename(video_name)
 62 |             )
 63 |         print('Processing {}'.format(video_name))
 64 | 
 65 |         boxes = []
 66 |         segments = []
 67 |         keypoints = []
 68 | 
 69 |         for frame_i, im in enumerate(read_video(video_name)):
 70 | 
 71 |             logger.info('Frame {}'.format(frame_i))
 72 |             timers = defaultdict(Timer)
 73 |             t = time.time()
 74 |             with c2_utils.NamedCudaScope(0):
 75 |                 cls_boxes, cls_segms, cls_keyps = infer_engine.im_detect_all(
 76 |                     model, im, None, timers=timers
 77 |                 )
 78 |             logger.info('Inference time: {:.3f}s'.format(time.time() - t))
 79 |             for k, v in timers.items():
 80 |                 logger.info(' | {}: {:.3f}s'.format(k, v.average_time))
 81 | 
 82 |             boxes.append(cls_boxes)
 83 |             segments.append(cls_segms)
 84 |             keypoints.append(cls_keyps)
 85 | 
 86 |         
 87 |         # Video resolution
 88 |         metadata = {
 89 |             'w': im.shape[1],
 90 |             'h': im.shape[0],
 91 |         }
 92 |         
 93 |         np.savez_compressed(out_name, boxes=boxes, segments=segments, keypoints=keypoints, metadata=metadata)
 94 | 
 95 | 
 96 | if __name__ == '__main__':
 97 |     workspace.GlobalInit(['caffe2', '--caffe2_log_level=0'])
 98 |     setup_logging(__name__)
 99 |     args = parse_args()
100 |     main(args)
101 | 


--------------------------------------------------------------------------------
/inference/infer_video_d2.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2018-present, Facebook, Inc.
  2 | # All rights reserved.
  3 | #
  4 | # This source code is licensed under the license found in the
  5 | # LICENSE file in the root directory of this source tree.
  6 | #
  7 | 
  8 | """Perform inference on a single video or all videos with a certain extension
  9 | (e.g., .mp4) in a folder.
 10 | """
 11 | 
 12 | import detectron2
 13 | from detectron2.utils.logger import setup_logger
 14 | from detectron2.config import get_cfg
 15 | from detectron2 import model_zoo
 16 | from detectron2.engine import DefaultPredictor
 17 | 
 18 | import subprocess as sp
 19 | import numpy as np
 20 | import time
 21 | import argparse
 22 | import sys
 23 | import os
 24 | import glob
 25 | 
 26 | def parse_args():
 27 |     parser = argparse.ArgumentParser(description='End-to-end inference')
 28 |     parser.add_argument(
 29 |         '--cfg',
 30 |         dest='cfg',
 31 |         help='cfg model file (/path/to/model_config.yaml)',
 32 |         default=None,
 33 |         type=str
 34 |     )
 35 |     parser.add_argument(
 36 |         '--output-dir',
 37 |         dest='output_dir',
 38 |         help='directory for visualization pdfs (default: /tmp/infer_simple)',
 39 |         default='/tmp/infer_simple',
 40 |         type=str
 41 |     )
 42 |     parser.add_argument(
 43 |         '--image-ext',
 44 |         dest='image_ext',
 45 |         help='image file name extension (default: mp4)',
 46 |         default='mp4',
 47 |         type=str
 48 |     )
 49 |     parser.add_argument(
 50 |         'im_or_folder', help='image or folder of images', default=None
 51 |     )
 52 |     if len(sys.argv) == 1:
 53 |         parser.print_help()
 54 |         sys.exit(1)
 55 |     return parser.parse_args()
 56 | 
 57 | def get_resolution(filename):
 58 |     command = ['ffprobe', '-v', 'error', '-select_streams', 'v:0',
 59 |                '-show_entries', 'stream=width,height', '-of', 'csv=p=0', filename]
 60 |     pipe = sp.Popen(command, stdout=sp.PIPE, bufsize=-1)
 61 |     for line in pipe.stdout:
 62 |         w, h = line.decode().strip().split(',')
 63 |         return int(w), int(h)
 64 | 
 65 | def read_video(filename):
 66 |     w, h = get_resolution(filename)
 67 | 
 68 |     command = ['ffmpeg',
 69 |             '-i', filename,
 70 |             '-f', 'image2pipe',
 71 |             '-pix_fmt', 'bgr24',
 72 |             '-vsync', '0',
 73 |             '-vcodec', 'rawvideo', '-']
 74 | 
 75 |     pipe = sp.Popen(command, stdout=sp.PIPE, bufsize=-1)
 76 |     while True:
 77 |         data = pipe.stdout.read(w*h*3)
 78 |         if not data:
 79 |             break
 80 |         yield np.frombuffer(data, dtype='uint8').reshape((h, w, 3))
 81 | 
 82 | 
 83 | def main(args):
 84 | 
 85 |     cfg = get_cfg()
 86 |     cfg.merge_from_file(model_zoo.get_config_file(args.cfg))
 87 |     cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.7
 88 |     cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url(args.cfg)
 89 |     predictor = DefaultPredictor(cfg)
 90 |     
 91 | 
 92 |     if os.path.isdir(args.im_or_folder):
 93 |         im_list = glob.iglob(args.im_or_folder + '/*.' + args.image_ext)
 94 |     else:
 95 |         im_list = [args.im_or_folder]
 96 | 
 97 |     for video_name in im_list:
 98 |         out_name = os.path.join(
 99 |                 args.output_dir, os.path.basename(video_name)
100 |             )
101 |         print('Processing {}'.format(video_name))
102 | 
103 |         boxes = []
104 |         segments = []
105 |         keypoints = []
106 | 
107 |         for frame_i, im in enumerate(read_video(video_name)):
108 |             t = time.time()
109 |             outputs = predictor(im)['instances'].to('cpu')
110 |             
111 |             print('Frame {} processed in {:.3f}s'.format(frame_i, time.time() - t))
112 | 
113 |             has_bbox = False
114 |             if outputs.has('pred_boxes'):
115 |                 bbox_tensor = outputs.pred_boxes.tensor.numpy()
116 |                 if len(bbox_tensor) > 0:
117 |                     has_bbox = True
118 |                     scores = outputs.scores.numpy()[:, None]
119 |                     bbox_tensor = np.concatenate((bbox_tensor, scores), axis=1)
120 |             if has_bbox:
121 |                 kps = outputs.pred_keypoints.numpy()
122 |                 kps_xy = kps[:, :, :2]
123 |                 kps_prob = kps[:, :, 2:3]
124 |                 kps_logit = np.zeros_like(kps_prob) # Dummy
125 |                 kps = np.concatenate((kps_xy, kps_logit, kps_prob), axis=2)
126 |                 kps = kps.transpose(0, 2, 1)
127 |             else:
128 |                 kps = []
129 |                 bbox_tensor = []
130 |                 
131 |             # Mimic Detectron1 format
132 |             cls_boxes = [[], bbox_tensor]
133 |             cls_keyps = [[], kps]
134 |             
135 |             boxes.append(cls_boxes)
136 |             segments.append(None)
137 |             keypoints.append(cls_keyps)
138 | 
139 |         
140 |         # Video resolution
141 |         metadata = {
142 |             'w': im.shape[1],
143 |             'h': im.shape[0],
144 |         }
145 |         
146 |         np.savez_compressed(out_name, boxes=boxes, segments=segments, keypoints=keypoints, metadata=metadata)
147 | 
148 | 
149 | if __name__ == '__main__':
150 |     setup_logger()
151 |     args = parse_args()
152 |     main(args)
153 | 


--------------------------------------------------------------------------------
/run.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2018-present, Facebook, Inc.
  2 | # All rights reserved.
  3 | #
  4 | # This source code is licensed under the license found in the
  5 | # LICENSE file in the root directory of this source tree.
  6 | #
  7 | 
  8 | import numpy as np
  9 | 
 10 | from common.arguments import parse_args
 11 | import torch
 12 | 
 13 | import torch.nn as nn
 14 | import torch.nn.functional as F
 15 | import torch.optim as optim
 16 | import os
 17 | import sys
 18 | import errno
 19 | 
 20 | from common.camera import *
 21 | from common.model import *
 22 | from common.loss import *
 23 | from common.generators import ChunkedGenerator, UnchunkedGenerator
 24 | from time import time
 25 | from common.utils import deterministic_random
 26 | 
 27 | args = parse_args()
 28 | print(args)
 29 | 
 30 | try:
 31 |     # Create checkpoint directory if it does not exist
 32 |     os.makedirs(args.checkpoint)
 33 | except OSError as e:
 34 |     if e.errno != errno.EEXIST:
 35 |         raise RuntimeError('Unable to create checkpoint directory:', args.checkpoint)
 36 | 
 37 | print('Loading dataset...')
 38 | dataset_path = 'data/data_3d_' + args.dataset + '.npz'
 39 | if args.dataset == 'h36m':
 40 |     from common.h36m_dataset import Human36mDataset
 41 |     dataset = Human36mDataset(dataset_path)
 42 | elif args.dataset.startswith('humaneva'):
 43 |     from common.humaneva_dataset import HumanEvaDataset
 44 |     dataset = HumanEvaDataset(dataset_path)
 45 | elif args.dataset.startswith('custom'):
 46 |     from common.custom_dataset import CustomDataset
 47 |     dataset = CustomDataset('data/data_2d_' + args.dataset + '_' + args.keypoints + '.npz')
 48 | else:
 49 |     raise KeyError('Invalid dataset')
 50 | 
 51 | print('Preparing data...')
 52 | for subject in dataset.subjects():
 53 |     for action in dataset[subject].keys():
 54 |         anim = dataset[subject][action]
 55 |         
 56 |         if 'positions' in anim:
 57 |             positions_3d = []
 58 |             for cam in anim['cameras']:
 59 |                 pos_3d = world_to_camera(anim['positions'], R=cam['orientation'], t=cam['translation'])
 60 |                 pos_3d[:, 1:] -= pos_3d[:, :1] # Remove global offset, but keep trajectory in first position
 61 |                 positions_3d.append(pos_3d)
 62 |             anim['positions_3d'] = positions_3d
 63 | 
 64 | print('Loading 2D detections...')
 65 | keypoints = np.load('data/data_2d_' + args.dataset + '_' + args.keypoints + '.npz', allow_pickle=True)
 66 | keypoints_metadata = keypoints['metadata'].item()
 67 | keypoints_symmetry = keypoints_metadata['keypoints_symmetry']
 68 | kps_left, kps_right = list(keypoints_symmetry[0]), list(keypoints_symmetry[1])
 69 | joints_left, joints_right = list(dataset.skeleton().joints_left()), list(dataset.skeleton().joints_right())
 70 | keypoints = keypoints['positions_2d'].item()
 71 | 
 72 | for subject in dataset.subjects():
 73 |     assert subject in keypoints, 'Subject {} is missing from the 2D detections dataset'.format(subject)
 74 |     for action in dataset[subject].keys():
 75 |         assert action in keypoints[subject], 'Action {} of subject {} is missing from the 2D detections dataset'.format(action, subject)
 76 |         if 'positions_3d' not in dataset[subject][action]:
 77 |             continue
 78 |             
 79 |         for cam_idx in range(len(keypoints[subject][action])):
 80 |             
 81 |             # We check for >= instead of == because some videos in H3.6M contain extra frames
 82 |             mocap_length = dataset[subject][action]['positions_3d'][cam_idx].shape[0]
 83 |             assert keypoints[subject][action][cam_idx].shape[0] >= mocap_length
 84 |             
 85 |             if keypoints[subject][action][cam_idx].shape[0] > mocap_length:
 86 |                 # Shorten sequence
 87 |                 keypoints[subject][action][cam_idx] = keypoints[subject][action][cam_idx][:mocap_length]
 88 | 
 89 |         assert len(keypoints[subject][action]) == len(dataset[subject][action]['positions_3d'])
 90 |         
 91 | for subject in keypoints.keys():
 92 |     for action in keypoints[subject]:
 93 |         for cam_idx, kps in enumerate(keypoints[subject][action]):
 94 |             # Normalize camera frame
 95 |             cam = dataset.cameras()[subject][cam_idx]
 96 |             kps[..., :2] = normalize_screen_coordinates(kps[..., :2], w=cam['res_w'], h=cam['res_h'])
 97 |             keypoints[subject][action][cam_idx] = kps
 98 | 
 99 | subjects_train = args.subjects_train.split(',')
100 | subjects_semi = [] if not args.subjects_unlabeled else args.subjects_unlabeled.split(',')
101 | if not args.render:
102 |     subjects_test = args.subjects_test.split(',')
103 | else:
104 |     subjects_test = [args.viz_subject]
105 | 
106 | semi_supervised = len(subjects_semi) > 0
107 | if semi_supervised and not dataset.supports_semi_supervised():
108 |     raise RuntimeError('Semi-supervised training is not implemented for this dataset')
109 |             
110 | def fetch(subjects, action_filter=None, subset=1, parse_3d_poses=True):
111 |     out_poses_3d = []
112 |     out_poses_2d = []
113 |     out_camera_params = []
114 |     for subject in subjects:
115 |         for action in keypoints[subject].keys():
116 |             if action_filter is not None:
117 |                 found = False
118 |                 for a in action_filter:
119 |                     if action.startswith(a):
120 |                         found = True
121 |                         break
122 |                 if not found:
123 |                     continue
124 |                 
125 |             poses_2d = keypoints[subject][action]
126 |             for i in range(len(poses_2d)): # Iterate across cameras
127 |                 out_poses_2d.append(poses_2d[i])
128 |                 
129 |             if subject in dataset.cameras():
130 |                 cams = dataset.cameras()[subject]
131 |                 assert len(cams) == len(poses_2d), 'Camera count mismatch'
132 |                 for cam in cams:
133 |                     if 'intrinsic' in cam:
134 |                         out_camera_params.append(cam['intrinsic'])
135 |                 
136 |             if parse_3d_poses and 'positions_3d' in dataset[subject][action]:
137 |                 poses_3d = dataset[subject][action]['positions_3d']
138 |                 assert len(poses_3d) == len(poses_2d), 'Camera count mismatch'
139 |                 for i in range(len(poses_3d)): # Iterate across cameras
140 |                     out_poses_3d.append(poses_3d[i])
141 |     
142 |     if len(out_camera_params) == 0:
143 |         out_camera_params = None
144 |     if len(out_poses_3d) == 0:
145 |         out_poses_3d = None
146 |     
147 |     stride = args.downsample
148 |     if subset < 1:
149 |         for i in range(len(out_poses_2d)):
150 |             n_frames = int(round(len(out_poses_2d[i])//stride * subset)*stride)
151 |             start = deterministic_random(0, len(out_poses_2d[i]) - n_frames + 1, str(len(out_poses_2d[i])))
152 |             out_poses_2d[i] = out_poses_2d[i][start:start+n_frames:stride]
153 |             if out_poses_3d is not None:
154 |                 out_poses_3d[i] = out_poses_3d[i][start:start+n_frames:stride]
155 |     elif stride > 1:
156 |         # Downsample as requested
157 |         for i in range(len(out_poses_2d)):
158 |             out_poses_2d[i] = out_poses_2d[i][::stride]
159 |             if out_poses_3d is not None:
160 |                 out_poses_3d[i] = out_poses_3d[i][::stride]
161 |     
162 | 
163 |     return out_camera_params, out_poses_3d, out_poses_2d
164 | 
165 | action_filter = None if args.actions == '*' else args.actions.split(',')
166 | if action_filter is not None:
167 |     print('Selected actions:', action_filter)
168 |     
169 | cameras_valid, poses_valid, poses_valid_2d = fetch(subjects_test, action_filter)
170 | 
171 | filter_widths = [int(x) for x in args.architecture.split(',')]
172 | if not args.disable_optimizations and not args.dense and args.stride == 1:
173 |     # Use optimized model for single-frame predictions
174 |     model_pos_train = TemporalModelOptimized1f(poses_valid_2d[0].shape[-2], poses_valid_2d[0].shape[-1], dataset.skeleton().num_joints(),
175 |                                 filter_widths=filter_widths, causal=args.causal, dropout=args.dropout, channels=args.channels)
176 | else:
177 |     # When incompatible settings are detected (stride > 1, dense filters, or disabled optimization) fall back to normal model
178 |     model_pos_train = TemporalModel(poses_valid_2d[0].shape[-2], poses_valid_2d[0].shape[-1], dataset.skeleton().num_joints(),
179 |                                 filter_widths=filter_widths, causal=args.causal, dropout=args.dropout, channels=args.channels,
180 |                                 dense=args.dense)
181 |     
182 | model_pos = TemporalModel(poses_valid_2d[0].shape[-2], poses_valid_2d[0].shape[-1], dataset.skeleton().num_joints(),
183 |                             filter_widths=filter_widths, causal=args.causal, dropout=args.dropout, channels=args.channels,
184 |                             dense=args.dense)
185 | 
186 | receptive_field = model_pos.receptive_field()
187 | print('INFO: Receptive field: {} frames'.format(receptive_field))
188 | pad = (receptive_field - 1) // 2 # Padding on each side
189 | if args.causal:
190 |     print('INFO: Using causal convolutions')
191 |     causal_shift = pad
192 | else:
193 |     causal_shift = 0
194 | 
195 | model_params = 0
196 | for parameter in model_pos.parameters():
197 |     model_params += parameter.numel()
198 | print('INFO: Trainable parameter count:', model_params)
199 | 
200 | if torch.cuda.is_available():
201 |     model_pos = model_pos.cuda()
202 |     model_pos_train = model_pos_train.cuda()
203 |     
204 | if args.resume or args.evaluate:
205 |     chk_filename = os.path.join(args.checkpoint, args.resume if args.resume else args.evaluate)
206 |     print('Loading checkpoint', chk_filename)
207 |     checkpoint = torch.load(chk_filename, map_location=lambda storage, loc: storage)
208 |     print('This model was trained for {} epochs'.format(checkpoint['epoch']))
209 |     model_pos_train.load_state_dict(checkpoint['model_pos'])
210 |     model_pos.load_state_dict(checkpoint['model_pos'])
211 |     
212 |     if args.evaluate and 'model_traj' in checkpoint:
213 |         # Load trajectory model if it contained in the checkpoint (e.g. for inference in the wild)
214 |         model_traj = TemporalModel(poses_valid_2d[0].shape[-2], poses_valid_2d[0].shape[-1], 1,
215 |                             filter_widths=filter_widths, causal=args.causal, dropout=args.dropout, channels=args.channels,
216 |                             dense=args.dense)
217 |         if torch.cuda.is_available():
218 |             model_traj = model_traj.cuda()
219 |         model_traj.load_state_dict(checkpoint['model_traj'])
220 |     else:
221 |         model_traj = None
222 |         
223 |     
224 | test_generator = UnchunkedGenerator(cameras_valid, poses_valid, poses_valid_2d,
225 |                                     pad=pad, causal_shift=causal_shift, augment=False,
226 |                                     kps_left=kps_left, kps_right=kps_right, joints_left=joints_left, joints_right=joints_right)
227 | print('INFO: Testing on {} frames'.format(test_generator.num_frames()))
228 | 
229 | if not args.evaluate:
230 |     cameras_train, poses_train, poses_train_2d = fetch(subjects_train, action_filter, subset=args.subset)
231 | 
232 |     lr = args.learning_rate
233 |     if semi_supervised:
234 |         cameras_semi, _, poses_semi_2d = fetch(subjects_semi, action_filter, parse_3d_poses=False)
235 |         
236 |         if not args.disable_optimizations and not args.dense and args.stride == 1:
237 |             # Use optimized model for single-frame predictions
238 |             model_traj_train = TemporalModelOptimized1f(poses_valid_2d[0].shape[-2], poses_valid_2d[0].shape[-1], 1,
239 |                     filter_widths=filter_widths, causal=args.causal, dropout=args.dropout, channels=args.channels)
240 |         else:
241 |             # When incompatible settings are detected (stride > 1, dense filters, or disabled optimization) fall back to normal model
242 |             model_traj_train = TemporalModel(poses_valid_2d[0].shape[-2], poses_valid_2d[0].shape[-1], 1,
243 |                     filter_widths=filter_widths, causal=args.causal, dropout=args.dropout, channels=args.channels,
244 |                     dense=args.dense)
245 |         
246 |         model_traj = TemporalModel(poses_valid_2d[0].shape[-2], poses_valid_2d[0].shape[-1], 1,
247 |                             filter_widths=filter_widths, causal=args.causal, dropout=args.dropout, channels=args.channels,
248 |                             dense=args.dense)
249 |         if torch.cuda.is_available():
250 |             model_traj = model_traj.cuda()
251 |             model_traj_train = model_traj_train.cuda()
252 |         optimizer = optim.Adam(list(model_pos_train.parameters()) + list(model_traj_train.parameters()),
253 |                                lr=lr, amsgrad=True)
254 |         
255 |         losses_2d_train_unlabeled = []
256 |         losses_2d_train_labeled_eval = []
257 |         losses_2d_train_unlabeled_eval = []
258 |         losses_2d_valid = []
259 | 
260 |         losses_traj_train = []
261 |         losses_traj_train_eval = []
262 |         losses_traj_valid = []
263 |     else:
264 |         optimizer = optim.Adam(model_pos_train.parameters(), lr=lr, amsgrad=True)
265 |         
266 |     lr_decay = args.lr_decay
267 | 
268 |     losses_3d_train = []
269 |     losses_3d_train_eval = []
270 |     losses_3d_valid = []
271 | 
272 |     epoch = 0
273 |     initial_momentum = 0.1
274 |     final_momentum = 0.001
275 |     
276 |     
277 |     train_generator = ChunkedGenerator(args.batch_size//args.stride, cameras_train, poses_train, poses_train_2d, args.stride,
278 |                                        pad=pad, causal_shift=causal_shift, shuffle=True, augment=args.data_augmentation,
279 |                                        kps_left=kps_left, kps_right=kps_right, joints_left=joints_left, joints_right=joints_right)
280 |     train_generator_eval = UnchunkedGenerator(cameras_train, poses_train, poses_train_2d,
281 |                                               pad=pad, causal_shift=causal_shift, augment=False)
282 |     print('INFO: Training on {} frames'.format(train_generator_eval.num_frames()))
283 |     if semi_supervised:
284 |         semi_generator = ChunkedGenerator(args.batch_size//args.stride, cameras_semi, None, poses_semi_2d, args.stride,
285 |                                           pad=pad, causal_shift=causal_shift, shuffle=True,
286 |                                           random_seed=4321, augment=args.data_augmentation,
287 |                                           kps_left=kps_left, kps_right=kps_right, joints_left=joints_left, joints_right=joints_right,
288 |                                           endless=True)
289 |         semi_generator_eval = UnchunkedGenerator(cameras_semi, None, poses_semi_2d,
290 |                                                  pad=pad, causal_shift=causal_shift, augment=False)
291 |         print('INFO: Semi-supervision on {} frames'.format(semi_generator_eval.num_frames()))
292 | 
293 |     if args.resume:
294 |         epoch = checkpoint['epoch']
295 |         if 'optimizer' in checkpoint and checkpoint['optimizer'] is not None:
296 |             optimizer.load_state_dict(checkpoint['optimizer'])
297 |             train_generator.set_random_state(checkpoint['random_state'])
298 |         else:
299 |             print('WARNING: this checkpoint does not contain an optimizer state. The optimizer will be reinitialized.')
300 |         
301 |         lr = checkpoint['lr']
302 |         if semi_supervised:
303 |             model_traj_train.load_state_dict(checkpoint['model_traj'])
304 |             model_traj.load_state_dict(checkpoint['model_traj'])
305 |             semi_generator.set_random_state(checkpoint['random_state_semi'])
306 |             
307 |     print('** Note: reported losses are averaged over all frames and test-time augmentation is not used here.')
308 |     print('** The final evaluation will be carried out after the last training epoch.')
309 |     
310 |     # Pos model only
311 |     while epoch < args.epochs:
312 |         start_time = time()
313 |         epoch_loss_3d_train = 0
314 |         epoch_loss_traj_train = 0
315 |         epoch_loss_2d_train_unlabeled = 0
316 |         N = 0
317 |         N_semi = 0
318 |         model_pos_train.train()
319 |         if semi_supervised:
320 |             # Semi-supervised scenario
321 |             model_traj_train.train()
322 |             for (_, batch_3d, batch_2d), (cam_semi, _, batch_2d_semi) in \
323 |                 zip(train_generator.next_epoch(), semi_generator.next_epoch()):
324 |                 
325 |                 # Fall back to supervised training for the first epoch (to avoid instability)
326 |                 skip = epoch < args.warmup
327 |                 
328 |                 cam_semi = torch.from_numpy(cam_semi.astype('float32'))
329 |                 inputs_3d = torch.from_numpy(batch_3d.astype('float32'))
330 |                 if torch.cuda.is_available():
331 |                     cam_semi = cam_semi.cuda()
332 |                     inputs_3d = inputs_3d.cuda()
333 |                     
334 |                 inputs_traj = inputs_3d[:, :, :1].clone()
335 |                 inputs_3d[:, :, 0] = 0
336 |                 
337 |                 # Split point between labeled and unlabeled samples in the batch
338 |                 split_idx = inputs_3d.shape[0]
339 | 
340 |                 inputs_2d = torch.from_numpy(batch_2d.astype('float32'))
341 |                 inputs_2d_semi = torch.from_numpy(batch_2d_semi.astype('float32'))
342 |                 if torch.cuda.is_available():
343 |                     inputs_2d = inputs_2d.cuda()
344 |                     inputs_2d_semi = inputs_2d_semi.cuda()
345 |                 inputs_2d_cat =  torch.cat((inputs_2d, inputs_2d_semi), dim=0) if not skip else inputs_2d
346 | 
347 |                 optimizer.zero_grad()
348 | 
349 |                 # Compute 3D poses
350 |                 predicted_3d_pos_cat = model_pos_train(inputs_2d_cat)
351 | 
352 |                 loss_3d_pos = mpjpe(predicted_3d_pos_cat[:split_idx], inputs_3d)
353 |                 epoch_loss_3d_train += inputs_3d.shape[0]*inputs_3d.shape[1] * loss_3d_pos.item()
354 |                 N += inputs_3d.shape[0]*inputs_3d.shape[1]
355 |                 loss_total = loss_3d_pos
356 | 
357 |                 # Compute global trajectory
358 |                 predicted_traj_cat = model_traj_train(inputs_2d_cat)
359 |                 w = 1 / inputs_traj[:, :, :, 2] # Weight inversely proportional to depth
360 |                 loss_traj = weighted_mpjpe(predicted_traj_cat[:split_idx], inputs_traj, w)
361 |                 epoch_loss_traj_train += inputs_3d.shape[0]*inputs_3d.shape[1] * loss_traj.item()
362 |                 assert inputs_traj.shape[0]*inputs_traj.shape[1] == inputs_3d.shape[0]*inputs_3d.shape[1]
363 |                 loss_total += loss_traj
364 | 
365 |                 if not skip:
366 |                     # Semi-supervised loss for unlabeled samples
367 |                     predicted_semi = predicted_3d_pos_cat[split_idx:]
368 |                     if pad > 0:
369 |                         target_semi = inputs_2d_semi[:, pad:-pad, :, :2].contiguous()
370 |                     else:
371 |                         target_semi = inputs_2d_semi[:, :, :, :2].contiguous()
372 |                         
373 |                     projection_func = project_to_2d_linear if args.linear_projection else project_to_2d
374 |                     reconstruction_semi = projection_func(predicted_semi + predicted_traj_cat[split_idx:], cam_semi)
375 | 
376 |                     loss_reconstruction = mpjpe(reconstruction_semi, target_semi) # On 2D poses
377 |                     epoch_loss_2d_train_unlabeled += predicted_semi.shape[0]*predicted_semi.shape[1] * loss_reconstruction.item()
378 |                     if not args.no_proj:
379 |                         loss_total += loss_reconstruction
380 |                     
381 |                     # Bone length term to enforce kinematic constraints
382 |                     if args.bone_length_term:
383 |                         dists = predicted_3d_pos_cat[:, :, 1:] - predicted_3d_pos_cat[:, :, dataset.skeleton().parents()[1:]]
384 |                         bone_lengths = torch.mean(torch.norm(dists, dim=3), dim=1)
385 |                         penalty = torch.mean(torch.abs(torch.mean(bone_lengths[:split_idx], dim=0) \
386 |                                                      - torch.mean(bone_lengths[split_idx:], dim=0)))
387 |                         loss_total += penalty
388 |                         
389 |                     
390 |                     N_semi += predicted_semi.shape[0]*predicted_semi.shape[1]
391 |                 else:
392 |                     N_semi += 1 # To avoid division by zero
393 | 
394 |                 loss_total.backward()
395 | 
396 |                 optimizer.step()
397 |             losses_traj_train.append(epoch_loss_traj_train / N)
398 |             losses_2d_train_unlabeled.append(epoch_loss_2d_train_unlabeled / N_semi)
399 |         else:
400 |             # Regular supervised scenario
401 |             for _, batch_3d, batch_2d in train_generator.next_epoch():
402 |                 inputs_3d = torch.from_numpy(batch_3d.astype('float32'))
403 |                 inputs_2d = torch.from_numpy(batch_2d.astype('float32'))
404 |                 if torch.cuda.is_available():
405 |                     inputs_3d = inputs_3d.cuda()
406 |                     inputs_2d = inputs_2d.cuda()
407 |                 inputs_3d[:, :, 0] = 0
408 | 
409 |                 optimizer.zero_grad()
410 | 
411 |                 # Predict 3D poses
412 |                 predicted_3d_pos = model_pos_train(inputs_2d)
413 |                 loss_3d_pos = mpjpe(predicted_3d_pos, inputs_3d)
414 |                 epoch_loss_3d_train += inputs_3d.shape[0]*inputs_3d.shape[1] * loss_3d_pos.item()
415 |                 N += inputs_3d.shape[0]*inputs_3d.shape[1]
416 | 
417 |                 loss_total = loss_3d_pos
418 |                 loss_total.backward()
419 | 
420 |                 optimizer.step()
421 | 
422 |         losses_3d_train.append(epoch_loss_3d_train / N)
423 | 
424 |         # End-of-epoch evaluation
425 |         with torch.no_grad():
426 |             model_pos.load_state_dict(model_pos_train.state_dict())
427 |             model_pos.eval()
428 |             if semi_supervised:
429 |                 model_traj.load_state_dict(model_traj_train.state_dict())
430 |                 model_traj.eval()
431 | 
432 |             epoch_loss_3d_valid = 0
433 |             epoch_loss_traj_valid = 0
434 |             epoch_loss_2d_valid = 0
435 |             N = 0
436 |             
437 |             if not args.no_eval:
438 |                 # Evaluate on test set
439 |                 for cam, batch, batch_2d in test_generator.next_epoch():
440 |                     inputs_3d = torch.from_numpy(batch.astype('float32'))
441 |                     inputs_2d = torch.from_numpy(batch_2d.astype('float32'))
442 |                     if torch.cuda.is_available():
443 |                         inputs_3d = inputs_3d.cuda()
444 |                         inputs_2d = inputs_2d.cuda()
445 |                     inputs_traj = inputs_3d[:, :, :1].clone()
446 |                     inputs_3d[:, :, 0] = 0
447 | 
448 |                     # Predict 3D poses
449 |                     predicted_3d_pos = model_pos(inputs_2d)
450 |                     loss_3d_pos = mpjpe(predicted_3d_pos, inputs_3d)
451 |                     epoch_loss_3d_valid += inputs_3d.shape[0]*inputs_3d.shape[1] * loss_3d_pos.item()
452 |                     N += inputs_3d.shape[0]*inputs_3d.shape[1]
453 | 
454 |                     if semi_supervised:
455 |                         cam = torch.from_numpy(cam.astype('float32'))
456 |                         if torch.cuda.is_available():
457 |                             cam = cam.cuda()
458 | 
459 |                         predicted_traj = model_traj(inputs_2d)
460 |                         loss_traj = mpjpe(predicted_traj, inputs_traj)
461 |                         epoch_loss_traj_valid += inputs_traj.shape[0]*inputs_traj.shape[1] * loss_traj.item()
462 |                         assert inputs_traj.shape[0]*inputs_traj.shape[1] == inputs_3d.shape[0]*inputs_3d.shape[1]
463 | 
464 |                         if pad > 0:
465 |                             target = inputs_2d[:, pad:-pad, :, :2].contiguous()
466 |                         else:
467 |                             target = inputs_2d[:, :, :, :2].contiguous()
468 |                         reconstruction = project_to_2d(predicted_3d_pos + predicted_traj, cam)
469 |                         loss_reconstruction = mpjpe(reconstruction, target) # On 2D poses
470 |                         epoch_loss_2d_valid += reconstruction.shape[0]*reconstruction.shape[1] * loss_reconstruction.item()
471 |                         assert reconstruction.shape[0]*reconstruction.shape[1] == inputs_3d.shape[0]*inputs_3d.shape[1]
472 | 
473 |                 losses_3d_valid.append(epoch_loss_3d_valid / N)
474 |                 if semi_supervised:
475 |                     losses_traj_valid.append(epoch_loss_traj_valid / N)
476 |                     losses_2d_valid.append(epoch_loss_2d_valid / N)
477 | 
478 | 
479 |                 # Evaluate on training set, this time in evaluation mode
480 |                 epoch_loss_3d_train_eval = 0
481 |                 epoch_loss_traj_train_eval = 0
482 |                 epoch_loss_2d_train_labeled_eval = 0
483 |                 N = 0
484 |                 for cam, batch, batch_2d in train_generator_eval.next_epoch():
485 |                     if batch_2d.shape[1] == 0:
486 |                         # This can only happen when downsampling the dataset
487 |                         continue
488 |                         
489 |                     inputs_3d = torch.from_numpy(batch.astype('float32'))
490 |                     inputs_2d = torch.from_numpy(batch_2d.astype('float32'))
491 |                     if torch.cuda.is_available():
492 |                         inputs_3d = inputs_3d.cuda()
493 |                         inputs_2d = inputs_2d.cuda()
494 |                     inputs_traj = inputs_3d[:, :, :1].clone()
495 |                     inputs_3d[:, :, 0] = 0
496 | 
497 |                     # Compute 3D poses
498 |                     predicted_3d_pos = model_pos(inputs_2d)
499 |                     loss_3d_pos = mpjpe(predicted_3d_pos, inputs_3d)
500 |                     epoch_loss_3d_train_eval += inputs_3d.shape[0]*inputs_3d.shape[1] * loss_3d_pos.item()
501 |                     N += inputs_3d.shape[0]*inputs_3d.shape[1]
502 | 
503 |                     if semi_supervised:
504 |                         cam = torch.from_numpy(cam.astype('float32'))
505 |                         if torch.cuda.is_available():
506 |                             cam = cam.cuda()
507 |                         predicted_traj = model_traj(inputs_2d)
508 |                         loss_traj = mpjpe(predicted_traj, inputs_traj)
509 |                         epoch_loss_traj_train_eval += inputs_traj.shape[0]*inputs_traj.shape[1] * loss_traj.item()
510 |                         assert inputs_traj.shape[0]*inputs_traj.shape[1] == inputs_3d.shape[0]*inputs_3d.shape[1]
511 | 
512 |                         if pad > 0:
513 |                             target = inputs_2d[:, pad:-pad, :, :2].contiguous()
514 |                         else:
515 |                             target = inputs_2d[:, :, :, :2].contiguous()
516 |                         reconstruction = project_to_2d(predicted_3d_pos + predicted_traj, cam)
517 |                         loss_reconstruction = mpjpe(reconstruction, target)
518 |                         epoch_loss_2d_train_labeled_eval += reconstruction.shape[0]*reconstruction.shape[1] * loss_reconstruction.item()
519 |                         assert reconstruction.shape[0]*reconstruction.shape[1] == inputs_3d.shape[0]*inputs_3d.shape[1]
520 | 
521 |                 losses_3d_train_eval.append(epoch_loss_3d_train_eval / N)
522 |                 if semi_supervised:
523 |                     losses_traj_train_eval.append(epoch_loss_traj_train_eval / N)
524 |                     losses_2d_train_labeled_eval.append(epoch_loss_2d_train_labeled_eval / N)
525 | 
526 |                 # Evaluate 2D loss on unlabeled training set (in evaluation mode)
527 |                 epoch_loss_2d_train_unlabeled_eval = 0
528 |                 N_semi = 0
529 |                 if semi_supervised:
530 |                     for cam, _, batch_2d in semi_generator_eval.next_epoch():
531 |                         cam = torch.from_numpy(cam.astype('float32'))
532 |                         inputs_2d_semi = torch.from_numpy(batch_2d.astype('float32'))
533 |                         if torch.cuda.is_available():
534 |                             cam = cam.cuda()
535 |                             inputs_2d_semi = inputs_2d_semi.cuda()
536 | 
537 |                         predicted_3d_pos_semi = model_pos(inputs_2d_semi)
538 |                         predicted_traj_semi = model_traj(inputs_2d_semi)
539 |                         if pad > 0:
540 |                             target_semi = inputs_2d_semi[:, pad:-pad, :, :2].contiguous()
541 |                         else:
542 |                             target_semi = inputs_2d_semi[:, :, :, :2].contiguous()
543 |                         reconstruction_semi = project_to_2d(predicted_3d_pos_semi + predicted_traj_semi, cam)
544 |                         loss_reconstruction_semi = mpjpe(reconstruction_semi, target_semi)
545 | 
546 |                         epoch_loss_2d_train_unlabeled_eval += reconstruction_semi.shape[0]*reconstruction_semi.shape[1] \
547 |                                                               * loss_reconstruction_semi.item()
548 |                         N_semi += reconstruction_semi.shape[0]*reconstruction_semi.shape[1]
549 |                     losses_2d_train_unlabeled_eval.append(epoch_loss_2d_train_unlabeled_eval / N_semi)
550 | 
551 |         elapsed = (time() - start_time)/60
552 |         
553 |         if args.no_eval:
554 |             print('[%d] time %.2f lr %f 3d_train %f' % (
555 |                     epoch + 1,
556 |                     elapsed,
557 |                     lr,
558 |                     losses_3d_train[-1] * 1000))
559 |         else:
560 |             if semi_supervised:
561 |                 print('[%d] time %.2f lr %f 3d_train %f 3d_eval %f traj_eval %f 3d_valid %f '
562 |                       'traj_valid %f 2d_train_sup %f 2d_train_unsup %f 2d_valid %f' % (
563 |                         epoch + 1,
564 |                         elapsed,
565 |                         lr,
566 |                         losses_3d_train[-1] * 1000,
567 |                         losses_3d_train_eval[-1] * 1000,
568 |                         losses_traj_train_eval[-1] * 1000,
569 |                         losses_3d_valid[-1] * 1000,
570 |                         losses_traj_valid[-1] * 1000,
571 |                         losses_2d_train_labeled_eval[-1],
572 |                         losses_2d_train_unlabeled_eval[-1],
573 |                         losses_2d_valid[-1]))
574 |             else:
575 |                 print('[%d] time %.2f lr %f 3d_train %f 3d_eval %f 3d_valid %f' % (
576 |                         epoch + 1,
577 |                         elapsed,
578 |                         lr,
579 |                         losses_3d_train[-1] * 1000,
580 |                         losses_3d_train_eval[-1] * 1000,
581 |                         losses_3d_valid[-1]  *1000))
582 |         
583 |         # Decay learning rate exponentially
584 |         lr *= lr_decay
585 |         for param_group in optimizer.param_groups:
586 |             param_group['lr'] *= lr_decay
587 |         epoch += 1
588 |         
589 |         # Decay BatchNorm momentum
590 |         momentum = initial_momentum * np.exp(-epoch/args.epochs * np.log(initial_momentum/final_momentum))
591 |         model_pos_train.set_bn_momentum(momentum)
592 |         if semi_supervised:
593 |             model_traj_train.set_bn_momentum(momentum)
594 |             
595 |         # Save checkpoint if necessary
596 |         if epoch % args.checkpoint_frequency == 0:
597 |             chk_path = os.path.join(args.checkpoint, 'epoch_{}.bin'.format(epoch))
598 |             print('Saving checkpoint to', chk_path)
599 |             
600 |             torch.save({
601 |                 'epoch': epoch,
602 |                 'lr': lr,
603 |                 'random_state': train_generator.random_state(),
604 |                 'optimizer': optimizer.state_dict(),
605 |                 'model_pos': model_pos_train.state_dict(),
606 |                 'model_traj': model_traj_train.state_dict() if semi_supervised else None,
607 |                 'random_state_semi': semi_generator.random_state() if semi_supervised else None,
608 |             }, chk_path)
609 |             
610 |         # Save training curves after every epoch, as .png images (if requested)
611 |         if args.export_training_curves and epoch > 3:
612 |             if 'matplotlib' not in sys.modules:
613 |                 import matplotlib
614 |                 matplotlib.use('Agg')
615 |                 import matplotlib.pyplot as plt
616 |             
617 |             plt.figure()
618 |             epoch_x = np.arange(3, len(losses_3d_train)) + 1
619 |             plt.plot(epoch_x, losses_3d_train[3:], '--', color='C0')
620 |             plt.plot(epoch_x, losses_3d_train_eval[3:], color='C0')
621 |             plt.plot(epoch_x, losses_3d_valid[3:], color='C1')
622 |             plt.legend(['3d train', '3d train (eval)', '3d valid (eval)'])
623 |             plt.ylabel('MPJPE (m)')
624 |             plt.xlabel('Epoch')
625 |             plt.xlim((3, epoch))
626 |             plt.savefig(os.path.join(args.checkpoint, 'loss_3d.png'))
627 | 
628 |             if semi_supervised:
629 |                 plt.figure()
630 |                 plt.plot(epoch_x, losses_traj_train[3:], '--', color='C0')
631 |                 plt.plot(epoch_x, losses_traj_train_eval[3:], color='C0')
632 |                 plt.plot(epoch_x, losses_traj_valid[3:], color='C1')
633 |                 plt.legend(['traj. train', 'traj. train (eval)', 'traj. valid (eval)'])
634 |                 plt.ylabel('Mean distance (m)')
635 |                 plt.xlabel('Epoch')
636 |                 plt.xlim((3, epoch))
637 |                 plt.savefig(os.path.join(args.checkpoint, 'loss_traj.png'))
638 | 
639 |                 plt.figure()
640 |                 plt.plot(epoch_x, losses_2d_train_labeled_eval[3:], color='C0')
641 |                 plt.plot(epoch_x, losses_2d_train_unlabeled[3:], '--', color='C1')
642 |                 plt.plot(epoch_x, losses_2d_train_unlabeled_eval[3:], color='C1')
643 |                 plt.plot(epoch_x, losses_2d_valid[3:], color='C2')
644 |                 plt.legend(['2d train labeled (eval)', '2d train unlabeled', '2d train unlabeled (eval)', '2d valid (eval)'])
645 |                 plt.ylabel('MPJPE (2D)')
646 |                 plt.xlabel('Epoch')
647 |                 plt.xlim((3, epoch))
648 |                 plt.savefig(os.path.join(args.checkpoint, 'loss_2d.png'))
649 |             plt.close('all')
650 | 
651 | # Evaluate
652 | def evaluate(test_generator, action=None, return_predictions=False, use_trajectory_model=False):
653 |     epoch_loss_3d_pos = 0
654 |     epoch_loss_3d_pos_procrustes = 0
655 |     epoch_loss_3d_pos_scale = 0
656 |     epoch_loss_3d_vel = 0
657 |     with torch.no_grad():
658 |         if not use_trajectory_model:
659 |             model_pos.eval()
660 |         else:
661 |             model_traj.eval()
662 |         N = 0
663 |         for _, batch, batch_2d in test_generator.next_epoch():
664 |             inputs_2d = torch.from_numpy(batch_2d.astype('float32'))
665 |             if torch.cuda.is_available():
666 |                 inputs_2d = inputs_2d.cuda()
667 | 
668 |             # Positional model
669 |             if not use_trajectory_model:
670 |                 predicted_3d_pos = model_pos(inputs_2d)
671 |             else:
672 |                 predicted_3d_pos = model_traj(inputs_2d)
673 | 
674 |             # Test-time augmentation (if enabled)
675 |             if test_generator.augment_enabled():
676 |                 # Undo flipping and take average with non-flipped version
677 |                 predicted_3d_pos[1, :, :, 0] *= -1
678 |                 if not use_trajectory_model:
679 |                     predicted_3d_pos[1, :, joints_left + joints_right] = predicted_3d_pos[1, :, joints_right + joints_left]
680 |                 predicted_3d_pos = torch.mean(predicted_3d_pos, dim=0, keepdim=True)
681 |                 
682 |             if return_predictions:
683 |                 return predicted_3d_pos.squeeze(0).cpu().numpy()
684 |                 
685 |             inputs_3d = torch.from_numpy(batch.astype('float32'))
686 |             if torch.cuda.is_available():
687 |                 inputs_3d = inputs_3d.cuda()
688 |             inputs_3d[:, :, 0] = 0    
689 |             if test_generator.augment_enabled():
690 |                 inputs_3d = inputs_3d[:1]
691 | 
692 |             error = mpjpe(predicted_3d_pos, inputs_3d)
693 |             epoch_loss_3d_pos_scale += inputs_3d.shape[0]*inputs_3d.shape[1] * n_mpjpe(predicted_3d_pos, inputs_3d).item()
694 | 
695 |             epoch_loss_3d_pos += inputs_3d.shape[0]*inputs_3d.shape[1] * error.item()
696 |             N += inputs_3d.shape[0] * inputs_3d.shape[1]
697 |             
698 |             inputs = inputs_3d.cpu().numpy().reshape(-1, inputs_3d.shape[-2], inputs_3d.shape[-1])
699 |             predicted_3d_pos = predicted_3d_pos.cpu().numpy().reshape(-1, inputs_3d.shape[-2], inputs_3d.shape[-1])
700 | 
701 |             epoch_loss_3d_pos_procrustes += inputs_3d.shape[0]*inputs_3d.shape[1] * p_mpjpe(predicted_3d_pos, inputs)
702 | 
703 |             # Compute velocity error
704 |             epoch_loss_3d_vel += inputs_3d.shape[0]*inputs_3d.shape[1] * mean_velocity_error(predicted_3d_pos, inputs)
705 |             
706 |     if action is None:
707 |         print('----------')
708 |     else:
709 |         print('----'+action+'----')
710 |     e1 = (epoch_loss_3d_pos / N)*1000
711 |     e2 = (epoch_loss_3d_pos_procrustes / N)*1000
712 |     e3 = (epoch_loss_3d_pos_scale / N)*1000
713 |     ev = (epoch_loss_3d_vel / N)*1000
714 |     print('Test time augmentation:', test_generator.augment_enabled())
715 |     print('Protocol #1 Error (MPJPE):', e1, 'mm')
716 |     print('Protocol #2 Error (P-MPJPE):', e2, 'mm')
717 |     print('Protocol #3 Error (N-MPJPE):', e3, 'mm')
718 |     print('Velocity Error (MPJVE):', ev, 'mm')
719 |     print('----------')
720 | 
721 |     return e1, e2, e3, ev
722 | 
723 | 
724 | if args.render:
725 |     print('Rendering...')
726 |     
727 |     input_keypoints = keypoints[args.viz_subject][args.viz_action][args.viz_camera].copy()
728 |     ground_truth = None
729 |     if args.viz_subject in dataset.subjects() and args.viz_action in dataset[args.viz_subject]:
730 |         if 'positions_3d' in dataset[args.viz_subject][args.viz_action]:
731 |             ground_truth = dataset[args.viz_subject][args.viz_action]['positions_3d'][args.viz_camera].copy()
732 |     if ground_truth is None:
733 |         print('INFO: this action is unlabeled. Ground truth will not be rendered.')
734 |         
735 |     gen = UnchunkedGenerator(None, None, [input_keypoints],
736 |                              pad=pad, causal_shift=causal_shift, augment=args.test_time_augmentation,
737 |                              kps_left=kps_left, kps_right=kps_right, joints_left=joints_left, joints_right=joints_right)
738 |     prediction = evaluate(gen, return_predictions=True)
739 |     if model_traj is not None and ground_truth is None:
740 |         prediction_traj = evaluate(gen, return_predictions=True, use_trajectory_model=True)
741 |         prediction += prediction_traj
742 |     
743 |     if args.viz_export is not None:
744 |         print('Exporting joint positions to', args.viz_export)
745 |         # Predictions are in camera space
746 |         np.save(args.viz_export, prediction)
747 |     
748 |     if args.viz_output is not None:
749 |         if ground_truth is not None:
750 |             # Reapply trajectory
751 |             trajectory = ground_truth[:, :1]
752 |             ground_truth[:, 1:] += trajectory
753 |             prediction += trajectory
754 |         
755 |         # Invert camera transformation
756 |         cam = dataset.cameras()[args.viz_subject][args.viz_camera]
757 |         if ground_truth is not None:
758 |             prediction = camera_to_world(prediction, R=cam['orientation'], t=cam['translation'])
759 |             ground_truth = camera_to_world(ground_truth, R=cam['orientation'], t=cam['translation'])
760 |         else:
761 |             # If the ground truth is not available, take the camera extrinsic params from a random subject.
762 |             # They are almost the same, and anyway, we only need this for visualization purposes.
763 |             for subject in dataset.cameras():
764 |                 if 'orientation' in dataset.cameras()[subject][args.viz_camera]:
765 |                     rot = dataset.cameras()[subject][args.viz_camera]['orientation']
766 |                     break
767 |             prediction = camera_to_world(prediction, R=rot, t=0)
768 |             # We don't have the trajectory, but at least we can rebase the height
769 |             prediction[:, :, 2] -= np.min(prediction[:, :, 2])
770 |         
771 |         anim_output = {'Reconstruction': prediction}
772 |         if ground_truth is not None and not args.viz_no_ground_truth:
773 |             anim_output['Ground truth'] = ground_truth
774 |         
775 |         input_keypoints = image_coordinates(input_keypoints[..., :2], w=cam['res_w'], h=cam['res_h'])
776 |         
777 |         from common.visualization import render_animation
778 |         render_animation(input_keypoints, keypoints_metadata, anim_output,
779 |                          dataset.skeleton(), dataset.fps(), args.viz_bitrate, cam['azimuth'], args.viz_output,
780 |                          limit=args.viz_limit, downsample=args.viz_downsample, size=args.viz_size,
781 |                          input_video_path=args.viz_video, viewport=(cam['res_w'], cam['res_h']),
782 |                          input_video_skip=args.viz_skip)
783 |     
784 | else:
785 |     print('Evaluating...')
786 |     all_actions = {}
787 |     all_actions_by_subject = {}
788 |     for subject in subjects_test:
789 |         if subject not in all_actions_by_subject:
790 |             all_actions_by_subject[subject] = {}
791 | 
792 |         for action in dataset[subject].keys():
793 |             action_name = action.split(' ')[0]
794 |             if action_name not in all_actions:
795 |                 all_actions[action_name] = []
796 |             if action_name not in all_actions_by_subject[subject]:
797 |                 all_actions_by_subject[subject][action_name] = []
798 |             all_actions[action_name].append((subject, action))
799 |             all_actions_by_subject[subject][action_name].append((subject, action))
800 | 
801 |     def fetch_actions(actions):
802 |         out_poses_3d = []
803 |         out_poses_2d = []
804 | 
805 |         for subject, action in actions:
806 |             poses_2d = keypoints[subject][action]
807 |             for i in range(len(poses_2d)): # Iterate across cameras
808 |                 out_poses_2d.append(poses_2d[i])
809 | 
810 |             poses_3d = dataset[subject][action]['positions_3d']
811 |             assert len(poses_3d) == len(poses_2d), 'Camera count mismatch'
812 |             for i in range(len(poses_3d)): # Iterate across cameras
813 |                 out_poses_3d.append(poses_3d[i])
814 | 
815 |         stride = args.downsample
816 |         if stride > 1:
817 |             # Downsample as requested
818 |             for i in range(len(out_poses_2d)):
819 |                 out_poses_2d[i] = out_poses_2d[i][::stride]
820 |                 if out_poses_3d is not None:
821 |                     out_poses_3d[i] = out_poses_3d[i][::stride]
822 |         
823 |         return out_poses_3d, out_poses_2d
824 | 
825 |     def run_evaluation(actions, action_filter=None):
826 |         errors_p1 = []
827 |         errors_p2 = []
828 |         errors_p3 = []
829 |         errors_vel = []
830 | 
831 |         for action_key in actions.keys():
832 |             if action_filter is not None:
833 |                 found = False
834 |                 for a in action_filter:
835 |                     if action_key.startswith(a):
836 |                         found = True
837 |                         break
838 |                 if not found:
839 |                     continue
840 | 
841 |             poses_act, poses_2d_act = fetch_actions(actions[action_key])
842 |             gen = UnchunkedGenerator(None, poses_act, poses_2d_act,
843 |                                      pad=pad, causal_shift=causal_shift, augment=args.test_time_augmentation,
844 |                                      kps_left=kps_left, kps_right=kps_right, joints_left=joints_left, joints_right=joints_right)
845 |             e1, e2, e3, ev = evaluate(gen, action_key)
846 |             errors_p1.append(e1)
847 |             errors_p2.append(e2)
848 |             errors_p3.append(e3)
849 |             errors_vel.append(ev)
850 | 
851 |         print('Protocol #1   (MPJPE) action-wise average:', round(np.mean(errors_p1), 1), 'mm')
852 |         print('Protocol #2 (P-MPJPE) action-wise average:', round(np.mean(errors_p2), 1), 'mm')
853 |         print('Protocol #3 (N-MPJPE) action-wise average:', round(np.mean(errors_p3), 1), 'mm')
854 |         print('Velocity      (MPJVE) action-wise average:', round(np.mean(errors_vel), 2), 'mm')
855 | 
856 |     if not args.by_subject:
857 |         run_evaluation(all_actions, action_filter)
858 |     else:
859 |         for subject in all_actions_by_subject.keys():
860 |             print('Evaluating on subject', subject)
861 |             run_evaluation(all_actions_by_subject[subject], action_filter)
862 |             print('')


--------------------------------------------------------------------------------