Monocular joint pose and forces estimation

Context

Review of the CVPR 2019 paper Estimating 3D Motion and Forces of Person-Object Interactions from Monocular Video

Program: MVA Master's degree class on Robotics. ENS Paris-Saclay.
Authors:
- Balthazar Neveu
- Matthieu Dinot

📋 Technical report

📜 Poster

🧪 Demo

Summary

Original paper's approach

Author's code	Original Paper
	The original paper estimates 3D poses and internal torques / external forces applied on a human interacting with a tool. The goal behind this paper is to collect the dynamics of human menial activities from standard videos (including educational tutorials) to later enable behavior cloning on real robots.

Pipeline overview	Inverse dynamics
Multi-staged vision leads to human (2D & 3D) and object pose estimations aswell as contact prediction.	$q$ configuration states, $\tau$ muscle torques, $F$ external forces applied by the object (or ground) on the human, $\kappa$ denotes contact related constraints

Our work

💪 Simplified setup

For simplification reasons, we studied a much simple problem making the following assumptions

🔨❌ : no external objects or tools
💪 a single arm, not the full body.
📷 the camera is assumed calibrated and standing on a tripod.

📷 Video pipeline and inverse kinematics

The current code is able to batch process videos and perform:

2D pose and 3D pose estimation using Mediapipe (off the shelf)
- RGB video ➡️ 3D & 2D joint positions
inverse kinematics from the 3D points
- 3D points ➡️ arm configuration states (so called $q$)
fits a camera pose in order to minimize 2d reprojection error
- 2D points & 3D points from forward kinematics ➡️ Sequence of extrinsics matrices (just a 3D translation)

🧪 Dynamics: simulations

We mostly did validate the inverse dynamics optimizer on simulations.

Free fall	Free fall + friction

Known limitations

There are still many missing points:

We do not minimize the 2D reprojection error while performing Inverse Kinematics
Many attempts have been made in order to reproduce the dynamics optimizer in order to recover the torques (shoulder and elbow). This is part of the extra notebook but so far clean torques cannot be retrieved properly from videos.

Setup

Setup projectyl

git clone [email protected]:balthazarneveu/monocular_pose_and_forces_estimation.git
cd monocular_pose_and_forces_estimation
pip install -e .

Download vision models

wget -O pose_landmarker.task -q https://storage.googleapis.com/mediapipe-models/pose_landmarker/pose_landmarker_heavy/float16/1/pose_landmarker_heavy.task

Processing videos

python scripts/batch_video_processing.py -i "data/*.mp4" -o __out -A demo

-i regex or list of videos
-o output folder (automatically created if not present)
--resize: resize ratio
-A to pick up an algo
- -A view will simply lanch a frame by frame viewer with zoom-in capabilities
- -A pose will run mediapipe pose estimator. (run once, store the results, next time reload)
- -A ik will run inverse kinematics frame by frame (but keep previous frame as an initlization)
- -A fit_cam will perform camera pose estimation.
-smoothness allows to modify the smoothness used when fitting camera pose
-calib points to the geometric calibration file, in case you're using a different camera.

--override can be used to recompute images and overwrite previous results.

Demo

Modelization

Pose estimation	Arm model

🗽 pose estimated from video	🔧 Synchronized "Digital twin"

Color code:

joints:
- 🔴 shoulder
- 🟢 elbow
- 🔵 wrist
limbs:
- 🟩 upper-arm
- 🟦 forearm

Pose estimation

Using Google Mediapipe, we're able to retrieve

2D projected points
3D coarse estimation

Coarse initialization of dynamic system inverse kinematics

Based on a frame by frame estimation (initalized from previous estimation), we're able to retrieve a decent state arm.

⚠️ As mediapipe does not provide 3D coordinates that grant fixed lenght arms, the system can't always be resolved correctly.

💡 An idea is to force the initialization of the inverse kinematics by forcing standardized length of the arms.

Projective camera geometry

Pinhole model is used

Camera	World

OpenCV convention	Pinocchio convention

test_camera_projection.py has a numerical example showing how you project a 3D point in the pinocchio referential onto the sensor.
- The top of the Eiffel Tower located $Z=324m$ above the ground ($Z$ world coordinates)
- is at a distance $Y=6km$ away from the camera ($Y$ in world coordinates)
- Camera is a Canon 5D MarkIII 24Mpix ($h=4000$, $w=6000$) camera
- The full frame sensor (24mm x 36mm) has a $\mu=6\text{µm}$ pixel pitch.
- $f_{\text{pix}} = \frac{f=\text{focal length}}{\mu=\text{pixel pitch}}$
- $y - \frac{h}{2}= \frac{Z.f}{Y} \approx \frac{-324*8333}{6000} \approx{-450pixels}$

Projecting the arm into the camera plane

Camera	World

Camera calibration

From a video, shoot a 7x10 checkerboard in multiple orientations. Use the Zhang method

python scripts/batch_video_processing.py -i "data/*cam*calib*.mp4" -o calibration -A camera_calibration

Camera pose optimization

A least-square optimizer is used in order to estimate the camera translation with regard to the shoulder. We ensure smoothness on camera/shoulder translation.

$min_{t_{\text{camera}}}||K [Q_{\text{camera}}, T_{\text{camera}}]. \vec{P^{\textrm{3D}}} - \vec{p^{\textrm{2D}}}||^2 + ||\frac{\Delta T_{\text{camera}}}{\Delta{t}}||^2$

In our simplified model, since the shoulder can fully rotated, we freeze camera the rotation $Q_{\text{camera}}=I_{3}$ and only optimize on translation.

Average 2D reprojection error on the whole sequence is in the order of 20 pixels on a FullHD video. It is no totally unexpected to have such an error:

As a matter of fact, when we standardize the arm length, we introduce an extra reprojection error in the 3D points.
Then the arm state has been estimated to fit these 3D points simply using inverse kinematics. The IK solution does not minimize the 2D reprojection error in the current solution.
To be fast for the demo, we compute the optimizer solution for each sliding windows of 30 frames. The optimization over the whole sequence takes quite a while (roughly 1 minute for a 1 minute video).

🎁 Extra

Demo Video

In case you'd like to dig in the original author's implementation, we provide a functional fork of Pytorch OpenPose which allows batch processing on videos.

Initially, we had plans to measure groundtruth velocity by retrieving the trajectory of a ball thrown by the hand.

We did a few experiments with SAM Segment anything which is wrapped in the library to select the ball.
We also have a CERES C++ ultra basic optimizer to fit the parabola of a ball in free fall.

CERES setup

git clone https://ceres-solver.googlesource.com/ceres-solver
cd ceres-solver
mkdir ceres-bin
cd ceres-bin
cmake ../../ceres-solver -DCMAKE_CUDA_ARCHITECTURES=native -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
make -j3
make install

Name		Name	Last commit message	Last commit date
Latest commit History 233 Commits
calibration		calibration
external/papers		external/papers
poster		poster
report		report
samples		samples
scripts		scripts
src/projectyl		src/projectyl
studies/camera_projection		studies/camera_projection
test		test
.flake8		.flake8
.gitignore		.gitignore
.isort.cfg		.isort.cfg
LICENSE		LICENSE
README.md		README.md
pose_landmarker.task		pose_landmarker.task
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Monocular joint pose and forces estimation

Context

📋 Technical report

📜 Poster

🧪 Demo

Summary

Original paper's approach

Our work

💪 Simplified setup

📷 Video pipeline and inverse kinematics

🧪 Dynamics: simulations

Known limitations

Setup

Setup projectyl

Processing videos

Demo

Modelization

Pose estimation

Coarse initialization of dynamic system inverse kinematics

Projective camera geometry

Projecting the arm into the camera plane

Camera calibration

Camera pose optimization

🎁 Extra

CERES setup

About

Releases

Packages

Contributors 2

Languages

License

balthazarneveu/monocular_pose_and_forces_estimation

Folders and files

Latest commit

History

Repository files navigation

Monocular joint pose and forces estimation

Context

📋 Technical report

📜 Poster

🧪 Demo

Summary

Original paper's approach

Our work

💪 Simplified setup

📷 Video pipeline and inverse kinematics

🧪 Dynamics: simulations

Known limitations

Setup

Setup projectyl

Processing videos

Demo

Modelization

Pose estimation

Coarse initialization of dynamic system inverse kinematics

Projective camera geometry

Projecting the arm into the camera plane

Camera calibration

Camera pose optimization

🎁 Extra

CERES setup

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages