Skip to content

Review of the paper "Estimating 3D Motion and Forces of Person-Object Interactions from Monocular Video" MVA Robotics class 2023

License

Notifications You must be signed in to change notification settings

balthazarneveu/monocular_pose_and_forces_estimation

Repository files navigation

Monocular joint pose and forces estimation

Context

Review of the CVPR 2019 paper Estimating 3D Motion and Forces of Person-Object Interactions from Monocular Video

πŸ“œ Poster

πŸ§ͺ Demo

overview

Summary

Original paper's approach

Author's code Original Paper
The original paper estimates 3D poses and internal torques / external forces applied on a human interacting with a tool. The goal behind this paper is to collect the dynamics of human menial activities from standard videos (including educational tutorials) to later enable behavior cloning on real robots.
Pipeline overview Inverse dynamics
Multi-staged vision leads to human (2D & 3D) and object pose estimations aswell as contact prediction. $q$ configuration states, $\tau$ muscle torques, $F$ external forces applied by the object (or ground) on the human, $\kappa$ denotes contact related constraints

Our work


πŸ’ͺ Simplified setup

For simplification reasons, we studied a much simple problem making the following assumptions

  • πŸ”¨βŒ : no external objects or tools
  • πŸ’ͺ a single arm, not the full body.
  • πŸ“· the camera is assumed calibrated and standing on a tripod.

πŸ“· Video pipeline and inverse kinematics

The current code is able to batch process videos and perform:

  • 2D pose and 3D pose estimation using Mediapipe (off the shelf)
    • RGB video ➑️ 3D & 2D joint positions
  • inverse kinematics from the 3D points
    • 3D points ➑️ arm configuration states (so called $q$)
  • fits a camera pose in order to minimize 2d reprojection error
    • 2D points & 3D points from forward kinematics ➑️ Sequence of extrinsics matrices (just a 3D translation)

πŸ§ͺ Dynamics: simulations

We mostly did validate the inverse dynamics optimizer on simulations.

Free fall Free fall + friction

Known limitations

There are still many missing points:

  • We do not minimize the 2D reprojection error while performing Inverse Kinematics
  • Many attempts have been made in order to reproduce the dynamics optimizer in order to recover the torques (shoulder and elbow). This is part of the extra notebook but so far clean torques cannot be retrieved properly from videos.

Setup

Setup projectyl

git clone [email protected]:balthazarneveu/monocular_pose_and_forces_estimation.git
cd monocular_pose_and_forces_estimation
pip install -e .

Download vision models

wget -O pose_landmarker.task -q https://storage.googleapis.com/mediapipe-models/pose_landmarker/pose_landmarker_heavy/float16/1/pose_landmarker_heavy.task

Processing videos

python scripts/batch_video_processing.py -i "data/*.mp4" -o __out -A demo
  • -i regex or list of videos
  • -o output folder (automatically created if not present)
  • --resize: resize ratio
  • -A to pick up an algo
    • -A view will simply lanch a frame by frame viewer with zoom-in capabilities
    • -A pose will run mediapipe pose estimator. (run once, store the results, next time reload)
    • -A ik will run inverse kinematics frame by frame (but keep previous frame as an initlization)
    • -A fit_cam will perform camera pose estimation.
  • -smoothness allows to modify the smoothness used when fitting camera pose
  • -calib points to the geometric calibration file, in case you're using a different camera.

--override can be used to recompute images and overwrite previous results.

Demo


Modelization

Pose estimation Arm model
πŸ—½ pose estimated from video πŸ”§ Synchronized "Digital twin"

Color code:

  • joints:
    • πŸ”΄ shoulder
    • 🟒 elbow
    • πŸ”΅ wrist
  • limbs:
    • 🟩 upper-arm
    • 🟦 forearm

Pose estimation

Using Google Mediapipe, we're able to retrieve

  • 2D projected points
  • 3D coarse estimation

Coarse initialization of dynamic system inverse kinematics

Based on a frame by frame estimation (initalized from previous estimation), we're able to retrieve a decent state arm.

⚠️ As mediapipe does not provide 3D coordinates that grant fixed lenght arms, the system can't always be resolved correctly.

πŸ’‘ An idea is to force the initialization of the inverse kinematics by forcing standardized length of the arms.

Projective camera geometry

  • Pinhole model is used
Camera World
OpenCV convention Pinocchio convention
  • test_camera_projection.py has a numerical example showing how you project a 3D point in the pinocchio referential onto the sensor.
    • The top of the Eiffel Tower located $Z=324m$ above the ground ($Z$ world coordinates)
    • is at a distance $Y=6km$ away from the camera ($Y$ in world coordinates)
    • Camera is a Canon 5D MarkIII 24Mpix ($h=4000$, $w=6000$) camera
    • The full frame sensor (24mm x 36mm) has a $\mu=6\text{Β΅m}$ pixel pitch.
    • $f_{\text{pix}} = \frac{f=\text{focal length}}{\mu=\text{pixel pitch}}$
    • $y - \frac{h}{2}= \frac{Z.f}{Y} \approx \frac{-324*8333}{6000} \approx{-450pixels}$

Projecting the arm into the camera plane

Camera World

Camera calibration

From a video, shoot a 7x10 checkerboard in multiple orientations. Use the Zhang method

python scripts/batch_video_processing.py -i "data/*cam*calib*.mp4" -o calibration -A camera_calibration

camera_calibration


Camera pose optimization

A least-square optimizer is used in order to estimate the camera translation with regard to the shoulder. We ensure smoothness on camera/shoulder translation.

$min_{t_{\text{camera}}}||K [Q_{\text{camera}}, T_{\text{camera}}]. \vec{P^{\textrm{3D}}} - \vec{p^{\textrm{2D}}}||^2 + ||\frac{\Delta T_{\text{camera}}}{\Delta{t}}||^2$

In our simplified model, since the shoulder can fully rotated, we freeze camera the rotation $Q_{\text{camera}}=I_{3}$ and only optimize on translation.

Average 2D reprojection error on the whole sequence is in the order of 20 pixels on a FullHD video. It is no totally unexpected to have such an error:

  • As a matter of fact, when we standardize the arm length, we introduce an extra reprojection error in the 3D points.
  • Then the arm state has been estimated to fit these 3D points simply using inverse kinematics. The IK solution does not minimize the 2D reprojection error in the current solution.
  • To be fast for the demo, we compute the optimizer solution for each sliding windows of 30 frames. The optimization over the whole sequence takes quite a while (roughly 1 minute for a 1 minute video).


🎁 Extra

Demo Video

In case you'd like to dig in the original author's implementation, we provide a functional fork of Pytorch OpenPose which allows batch processing on videos.

Initially, we had plans to measure groundtruth velocity by retrieving the trajectory of a ball thrown by the hand.

  • We did a few experiments with SAM Segment anything which is wrapped in the library to select the ball.
  • We also have a CERES C++ ultra basic optimizer to fit the parabola of a ball in free fall.

CERES setup

git clone https://ceres-solver.googlesource.com/ceres-solver
cd ceres-solver
mkdir ceres-bin
cd ceres-bin
cmake ../../ceres-solver -DCMAKE_CUDA_ARCHITECTURES=native -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
make -j3
make install

About

Review of the paper "Estimating 3D Motion and Forces of Person-Object Interactions from Monocular Video" MVA Robotics class 2023

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published