Review of the CVPR 2019 paper Estimating 3D Motion and Forces of Person-Object Interactions from Monocular Video
- Program: MVA Master's degree class on Robotics. ENS Paris-Saclay.
- Authors:
π Technical report
π Poster
Author's code | Original Paper |
---|---|
The original paper estimates 3D poses and internal torques / external forces applied on a human interacting with a tool. The goal behind this paper is to collect the dynamics of human menial activities from standard videos (including educational tutorials) to later enable behavior cloning on real robots. |
For simplification reasons, we studied a much simple problem making the following assumptions
- π¨β : no external objects or tools
- πͺ a single arm, not the full body.
- π· the camera is assumed calibrated and standing on a tripod.
The current code is able to batch process videos and perform:
- 2D pose and 3D pose estimation using Mediapipe (off the shelf)
- RGB video β‘οΈ 3D & 2D joint positions
- inverse kinematics from the 3D points
- 3D points β‘οΈ arm configuration states (so called
$q$ )
- 3D points β‘οΈ arm configuration states (so called
- fits a camera pose in order to minimize 2d reprojection error
- 2D points & 3D points from forward kinematics β‘οΈ Sequence of extrinsics matrices (just a 3D translation)
We mostly did validate the inverse dynamics optimizer on simulations.
Free fall | Free fall + friction |
---|---|
There are still many missing points:
- We do not minimize the 2D reprojection error while performing Inverse Kinematics
- Many attempts have been made in order to reproduce the dynamics optimizer in order to recover the torques (shoulder and elbow). This is part of the extra notebook but so far clean torques cannot be retrieved properly from videos.
git clone [email protected]:balthazarneveu/monocular_pose_and_forces_estimation.git
cd monocular_pose_and_forces_estimation
pip install -e .
Download vision models
wget -O pose_landmarker.task -q https://storage.googleapis.com/mediapipe-models/pose_landmarker/pose_landmarker_heavy/float16/1/pose_landmarker_heavy.task
python scripts/batch_video_processing.py -i "data/*.mp4" -o __out -A demo
-i
regex or list of videos-o
output folder (automatically created if not present)--resize
: resize ratio-A
to pick up an algo-A view
will simply lanch a frame by frame viewer with zoom-in capabilities-A pose
will run mediapipe pose estimator. (run once, store the results, next time reload)-A ik
will run inverse kinematics frame by frame (but keep previous frame as an initlization)-A fit_cam
will perform camera pose estimation.
-smoothness
allows to modify the smoothness used when fitting camera pose-calib
points to the geometric calibration file, in case you're using a different camera.
--override
can be used to recompute images and overwrite previous results.
Pose estimation | Arm model |
---|---|
π½ pose estimated from video | π§ Synchronized "Digital twin" |
Color code:
- joints:
- π΄ shoulder
- π’ elbow
- π΅ wrist
- limbs:
- π© upper-arm
- π¦ forearm
Using Google Mediapipe, we're able to retrieve
- 2D projected points
- 3D coarse estimation
Based on a frame by frame estimation (initalized from previous estimation), we're able to retrieve a decent state arm.
π‘ An idea is to force the initialization of the inverse kinematics by forcing standardized length of the arms.
- Pinhole model is used
Camera | World |
---|---|
OpenCV convention | Pinocchio convention |
-
test_camera_projection.py has a numerical example showing how you project a 3D point in the pinocchio referential onto the sensor.
- The top of the Eiffel Tower located
$Z=324m$ above the ground ($Z$ world coordinates) - is at a distance
$Y=6km$ away from the camera ($Y$ in world coordinates) - Camera is a Canon 5D MarkIII 24Mpix (
$h=4000$ ,$w=6000$ ) camera - The full frame sensor (24mm x 36mm) has a
$\mu=6\text{Β΅m}$ pixel pitch. $f_{\text{pix}} = \frac{f=\text{focal length}}{\mu=\text{pixel pitch}}$ $y - \frac{h}{2}= \frac{Z.f}{Y} \approx \frac{-324*8333}{6000} \approx{-450pixels}$
- The top of the Eiffel Tower located
Camera | World |
---|---|
From a video, shoot a 7x10 checkerboard in multiple orientations. Use the Zhang method
python scripts/batch_video_processing.py -i "data/*cam*calib*.mp4" -o calibration -A camera_calibration
A least-square optimizer is used in order to estimate the camera translation with regard to the shoulder. We ensure smoothness on camera/shoulder translation.
In our simplified model, since the shoulder can fully rotated, we freeze camera the rotation
Average 2D reprojection error on the whole sequence is in the order of 20 pixels on a FullHD video. It is no totally unexpected to have such an error:
- As a matter of fact, when we standardize the arm length, we introduce an extra reprojection error in the 3D points.
- Then the arm state has been estimated to fit these 3D points simply using inverse kinematics. The IK solution does not minimize the 2D reprojection error in the current solution.
- To be fast for the demo, we compute the optimizer solution for each sliding windows of 30 frames. The optimization over the whole sequence takes quite a while (roughly 1 minute for a 1 minute video).
In case you'd like to dig in the original author's implementation, we provide a functional fork of Pytorch OpenPose which allows batch processing on videos.
Initially, we had plans to measure groundtruth velocity by retrieving the trajectory of a ball thrown by the hand.
- We did a few experiments with SAM Segment anything which is wrapped in the library to select the ball.
- We also have a CERES C++ ultra basic optimizer to fit the parabola of a ball in free fall.
git clone https://ceres-solver.googlesource.com/ceres-solver
cd ceres-solver
mkdir ceres-bin
cd ceres-bin
cmake ../../ceres-solver -DCMAKE_CUDA_ARCHITECTURES=native -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
make -j3
make install