Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks

Amit Parekh, Nikolas Vitsakis, Alessandro Suglia, and Ioannis Konstas.

Unveiling the true robustness of multimodal models: A comprehensive framework to explore whether models are plausibly resilient.

Quick Start

Note

This codebase automatically downloads checkpoints and datasets so you don't need to do that manually. Everything is hosted on Hugging Face and uses HF so it's all cached too.

Clone this repository and navigate to the folder

git clone https://github.com/amitkparekh/CoGeLoT.git
cd CoGeLoT

Install the dependencies (I used PDM and Python 3.11)
```
pdm install
```

Make sure everything works and is installed correctly

pdm run pytest --deselect tests/test_online_evaluation.py

Train a model

pdm run python src/cogelot/entrypoints/train.py --experiment=01_their_vima

Evaluate a model from one of the provided checkpoints

pdm run python src/cogelot/entrypoints/evaluate.py trainer.devices=1 model.model.wandb_run_id=8lkml12g

For model.model.wandb_run_id, you can use any of the Run IDs from the table below.

What is included?

Everything. You should be able to run every single experiment from the paper. Datasets and models are hosted on HF.

While I tried to bring everything front and centre, some things might be buried. If you think these things should be brought forward, feel free to open a PR and bring them forward! I'll definitely be taking opinions regarding this into consideration for future projects.

Additionally, I've tried to work in a constrained, clean, and robust manner. I hope that helps you as much as it helped me.

Model Architectures and Checkpoints

Below is a table of each model run and where to find the checkpoints. We're providing all checkpoints from the end of each epoch, even though we only used the one from the very last epoch.

You do not need to download checkpoints yourself. This library contains multiple methods and functions to make the checkpoints work on our framework, and it's all included for you. All model checkpoints are stored on Hugging Face, but they will not work with the Transformers library out-of-the-box.

Instruction-style	Instruction Modalities	Prompt-conditioning	Vision Encoder	Shuffled Objects?	WandB Run ID	Experiment ID
Original	Text + Visual	Cross-Attention	Object-Centric	False	`8lkml12g`	`01_their_vima`
Original	Text + Visual	Cross-Attention	Object-Centric	True	`ftwoyjb1`	`01_their_vima_shuffle_obj`
Original	Text + Visual	Cross-Attention	Image-Patches	N/A	`ln4nrqhg`	`01_their_vima_patches`
Original	Text + Visual	Concatenate	Object-Centric	False	`bhuja4vo`	`08_their_gpt`
Original	Text + Visual	Concatenate	Object-Centric	True	`wn9jc5l8`	`08_their_gpt_shuffle_obj`
Original	Text + Visual	Concatenate	Image-Patches	N/A	`efxugme9`	`08_their_gpt_patches`
Paraphrases	Text + Visual	Cross-Attention	Object-Centric	False	`2df3mwfn`	`02_their_vima`
Paraphrases	Text + Visual	Cross-Attention	Object-Centric	True	`0nsnkaer`	`02_their_vima_shuffle_obj`
Paraphrases	Text + Visual	Cross-Attention	Image-Patches	N/A	`ah5btw8w`	`02_their_vima_patches`
Paraphrases	Text + Visual	Concatenate	Object-Centric	False	`fs5v61mz`	`09_their_gpt`
Paraphrases	Text + Visual	Concatenate	Object-Centric	True	`xb3yttg9`	`09_their_gpt_shuffle_obj`
Paraphrases	Text + Visual	Concatenate	Image-Patches	N/A	`zby6xk27`	`09_their_gpt_patches`

How I ran things

Important

Everything that was run, in some shape or form, starts from a module in src/cogelot/entrypoints/. This is what was used to run the dataset creation, train models, evaluate models, and more. Everything I ran started from that folder, every single time.

This is not a comprehensive library made for all use cases and every possible scenario. It's a research project. That said, I tried to make everything as clear as possible for you. In this section, I detailed how I did everything so that you can use it as an example for how to start yourself.

I have tried to make sure that docstrings and comments are relevant and detailed. If you want more information on what a function is doing or why it is doing that, feel free to make an issue. If you figure out something that I haven't described enough of, feel free to make a PR improving my documentation so that you, me, and future people can benefit from your insight.

How I managed and installed dependencies

I used PDM to manage this project. Everything you need to know about the installing the dependencies for this project can be found in the pyproject.toml.

To quickly install and get up and running, you can run the following:

pdm install

What if you use requirements.txt?

I have exported and included the requirements.txt from PDM. Using it is up to you. I'm not going to be maintaining it, but it's there if you need it.

How I install dependencies on every machine

I literally just run the following on the machines I use. I don't use Windows though so I can't help you there.

mise use [email protected] pdm@latest
pdm install

How to make sure it works on your machine

The quickest way to make sure you're all setup is to run either of the following:

If you know you've got a venv activated or something
```
python -m cogelot
```
If you're using PDM instead of activating the venv
```
pdm run python -m cogelot
```

How I checked that everything worked before I ran things

Things happen and things break. I needed a sense check to make sure everything worked. I developed everything using tests to verify that each pieces works in isolation and together. This is the first thing I did when using a new machine or node or whatever.

You can find all the tests in the tests/ folder. The various tests are a good way of looking how different pieces were implemented and are used. While coverage is not 100%, I used the tests with breakpoints to verify things are working as expected.

How to make sure all tests can be loaded without errors

pdm run pytest --deselect tests/test_online_evaluation.py --collect-only

This is also useful for just making sure things installed correctly and that all tests can be found.

How to run all the tests

pdm run pytest --deselect tests/test_online_evaluation.py

I've also got CI doing this so you can check the badge at the top of the README to see if everything is working as expected.

Check out pytest-xdist if you want to know more about running tests in parallel, or just throw -n auto on the end of the above commands. It makes it go faster.¹

Tip

Before spawning an instance and starting to train with GPUs, you can run the above command on your machine to make sure everything works on CPU. As Lightning handles all of the GPU communication, if it works on CPU, there's a 99% chance it'll work on GPU.²

How I trained models

Important

All model training was orchestrated with Hydra, and can be found in the configs/ folder.

I went all out on the Hydra stuff and everything is pretty compositional. The configs/experiments sub-dir contains the experiments that were run (and directly connect to the checkpoints table). As a result, if you want to just train a model, you can run:

pdm run python src/cogelot/entrypoints/train.py --experiment=01_their_vima

Tip

You can find the experiments in the folder, or check the Experiment ID column in the above table for what each one means since the names aren't the clearest.

Training on different hardware

The configs/hardware folder contains the hardware configurations that were used to run the experiments. These are used to set the number of GPUs, the number of CPUs, and the memory available to the model. These were preset for the cluster I was using, but you can adjust them to your needs.

How to train models on OCI

This was a while ago now, but I had a setup script which you can find at scripts/setup-oci-a100.sh. This was used to setup the environment on the OCI instance I was using. It's not perfect, but it's a good starting point.

How I trained models on K8s

Running on K8s was a bit more involved but it's all here. That said, it will be different for your setups.

My pod spec was:

apiVersion: v1
kind: Pod
metadata:
  name: &name cogelot-1
  namespace: ???
spec:
  restartPolicy: Never
  containers:
    - name: 1st
      image: amitkparekh/python-pdm-cuda:latest
      envFrom:
        - secretRef:
            name: amit-cogelot
      imagePullPolicy: Always
      command: ["/bin/bash", "-c"]
      args:
        - gh repo clone amitkparekh/cogelot cogelot &&
          cd cogelot &&
          bash ./scripts/setup-eidf.sh 2>&1 | tee setup-eidf.log &&
          sleep infinity
          #bash ./scripts/run-sweep-4.sh
      resources:
        requests:
          cpu: &num-cpu 10
          memory: &num-memory "150Gi"
          nvidia.com/gpu: &num-gpu 4
        limits:
          cpu: *num-cpu
          memory: *num-memory
          nvidia.com/gpu: *num-gpu
      volumeMounts:
        - mountPath: /mnt/ceph_rbd
          name: volume
          # this is necessary for training in distributed mode - used for different processes to communicate
        - mountPath: /dev/shm
          name: dshm1
  nodeSelector:
    nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB
  volumes:
    - name: volume
      persistentVolumeClaim:
        claimName: *name
    - name: dshm1
      emptyDir:
        medium: Memory

The Dockerfile is public and the secrets contained the following:

WANDB_API_KEY=???
HUGGING_FACE_HUB_TOKEN=???
GH_TOKEN=???

HF_HUB_VERBOSITY=info

WANDB_CONFIG_DIR=/mnt/ceph_rbd/wandb
WANDB_CACHE_DIR=/mnt/ceph_rbd/wandb
HF_HOME=/mnt/ceph_rbd/huggingface
TORCH_HOME=/mnt/ceph_rbd/torch

How I ran checkpoints in the environment

Again, this uses Hydra so, like training, the entrypoint is src/cogelot/entrypoints/evaluate.py and the config for it is configs/evaluate.yaml.

To run the evaluation in the environment, I used the following command:

pdm run python src/cogelot/entrypoints/evaluate.py trainer.devices=1 model.model.wandb_run_id=8lkml12g

How to choose your checkpoint

model.model.wandb_run_id parameter is important used to get the checkpoint to evaluate. The checkpoint ID is the one from the table above.

By default, we use the epoch from the last checkpoint, but if you want to change the epoch, just add model.model.epoch=<epoch_number> to the command.

How to run multiple runs in parallel

The trainer.devices creates multiple CPU processes for evaluation as eval does not need the GPU. Change the number in the command to however many processes you want.

Important things to note:

The more processes/devices you use, the more memeory you need since multiple instantiations of the model are loaded into the memory.
I did not do anything fancy with batching across instances. Since we use CPU for evaluation, I didn't need to.

How to perturb the instructions

You can find all of these in configs/evaluation_instance_transform/. For each file name, you can invoke them by appending evaluation_instance_transform=<file_name> to the command.

Evaluation Instance Transform	Description
`noop`	Interleave modalities in the prompt, by default
`gobbledygook_tokens`	Apply Gobbledygook Tokens to the prompt
`gobbledygook_words`	Apply Gobbledygook Words to the prompt
`reworded`	Use paraphrased instructions with interleaved modalities
`textual`	Convert visual referents to text
`textual_gobbledygook_words`	Convert visual referents to text and apply Gobbledygook Words
`textual_gobbledygook_tokens`	Convert visual referents to text and apply Gobbledygook Tokens
`textual_no_noun`	Convert visual referents to text, but remove the nouns
`textual_no_texture`	Convert visual referents to text, but remove the descriptions of nouns
`textual_generic_noun`	Convert visual referents to text, but replace each noun with a generic form (e.g. "block" becomes "thing")

How to disable modalities in the prompt

You can find all of these in configs/evaluation_prompt_modality/. For each file name, you can invoke them by appending evaluation_prompt_modality=<file_name> to the command.

Evaluation Prompt Modality	Description
`disable_none`	Do nothing
`disable_text`	Disable the text modality
`disable_visual`	Disable the visual modality
`disable_both`	Disable both modalities, basically masking every token

How to permute object token order for observations

Append model.should_shuffle_obj_per_observations=true to the command. This will shuffle the object tokens in the observation.

How to run on different difficulties

Append model.difficulty=<difficulty> to the command. The difficulties are:

easy
medium (unused)
hard (unused)
extreme
distracting
extremely_distracting

How I ran the checkpoint from VIMA in the environment

I downloaded the checkpoint from VIMA's repo, renamed it to them.ckpt and put it in storage/data/models. If you want to change the path used, you can change the path in configs/model/from_their_policy.yaml.

I used the following command to run the checkpoint from VIMA in the environment:

SLURM_JOB_NAME='bash' pdm run python src/cogelot/entrypoints/evaluate_theirs.py trainer.devices=20

You can use all of the other perturbations mentioned above.

How to run checkpoints with a live display

If you want to see what's going on live, you can append [email protected]=display onto the evaluate command.

Importantly, only use one process because I don't know what'll happen if you don't.

Also, this wasn't run on SLURM, just on my Mac. I can't speak for every machine so your mileage may vary.

How to evaluate models on SLURM

It is very unlikely that I ran things in a tmux session and just stared at it. I don't like copy-pasting hundreds of commands.

As experiments were often run on a compute cluster, I ran commands with SLURM. You can find these contained batch files in ./scripts/slurm/. These were made for my system, so some adjustments are likely going to be needed, but I'm hoping it's obvious and not too complicated!

How I prepared the dataset

So that things can be run quickly, the dataset was loaded and parsed with Pydantic, and then converted into a HF dataset. There are unit tests showing how this was done in tests/test_dataset_creation.py.

The dataset was processed in two steps. The first step was to parse the raw data and pickle it into individual files. This was done because it was the most time-consuming part of the process. The second step was to load the pickled files and convert them into a HF dataset.

To make loading data efficient when modelling, all the instances were tokenised in advanced. Similarly, this is also available on HF, as a different config name.

Note

For each of the following commands, you can append --help to get more information on the command and what it does and the various arguments to control it. Alternatively, you can change things using the Pydantic settings in src/cogelot/commmon/settings.py.

For example, each command has a way of distributing the load to multiple workers, and even splitting them across multiple SLURM jobs to make it go so much faster.

Step 1. Download the raw data from VIMA

The raw data was downloaded from VIMA. Each instance is a folder of multiple files. Once extracted, the folder structure looked like this:

<project_root>
└─ storage/
    └─ data/
        └─ raw/
            └─ vima_v6/
                └─ <task_name>/
                    └─ <instance_id>/

I used symlinks to make it easier to manage the data, but this is what the folder structure was. If you want to use a different directory, you can change it using the Pydantic settings in src/cogelot/common/settings.py.

Step 2. Parse the original data

The raw data was parsed and pickled into individual files, and then it was converted into a HF dataset. This is for speed.

pdm run python -m cogelot parse-original-dataset --replace-if-exists
pdm run python -m cogelot create-raw-dataset-per-task

I have separate SBATCH files for these steps:

scripts/slurm/parse-original-dataset.sh
scripts/slurm/create-raw-dataset.sh

Step 3. Tokenize and preprocess for faster training

Again, we preprocess and just dump each as pickles because it's faster before turning it into the HF dataset

pdm run python -m cogelot preprocess-instances
pdm run python -m cogelot create-preprocessed-dataset-per-task

I have separate SBATCH files for these steps:

scripts/slurm/preprocess-instances.sh
scripts/slurm/create-preprocessed-dataset.sh

Step 4. Create a dataset variant of paraphrased instructions

We use just the previous instances to make the new variations, and use environment variables to create the preprocessed versions of the dataset.

pdm run python -m cogelot create-reworded-dataset-per-task original reworded

DATASET_VARIANT=reworded pdm run python -m cogelot preprocess-instances
DATASET_VARIANT=reworded pdm run python -m cogelot create-preprocessed-dataset-per-task

Again, I have a SBATCH file for this: scripts/slurm/create-reworded-dataset.sh, or more conveniently, a bash script to submit SBATCH jobs: scripts/submit-reworded-dataset-creation-jobs.sh.

Step 5. Upload all the datasets

This one's just for me but refer to scripts/submit-dataset-upload-jobs.sh for uploading all the datasets to HF as fast as possible without hitting the rate limit.

License

VIMA, VIMA-Bench, and all artefacts from VIMA are licensed under the MIT License. Everything within this repository continues to be licensed under the MIT License.

Citation

@misc{parekh2024investigatingroleinstructionvariety,
  title = {Investigating the {{Role}} of {{Instruction Variety}} and {{Task Difficulty}} in {{Robotic Manipulation Tasks}}},
  author={Amit Parekh and Nikolas Vitsakis and Alessandro Suglia and Ioannis Konstas},
  year={2024},
  eprint={2407.03967},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2407.03967},
}

I don't know what happens if you replace auto with a number that has more processes than your machine. Maybe don't do that. ↩
This number is made up, but I'm pretty sure about it. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 1,225 Commits
.github/workflows		.github/workflows
.vscode		.vscode
configs		configs
docs		docs
notebooks		notebooks
scripts		scripts
src		src
storage/fixtures		storage/fixtures
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks

Quick Start

Contents

What is included?

Model Architectures and Checkpoints

How I ran things

How I managed and installed dependencies

How I checked that everything worked before I ran things

How I trained models

How I ran checkpoints in the environment

How I prepared the dataset

License

Citation

About

Releases 6

Languages

License

amitkparekh/CoGeLoT

Folders and files

Latest commit

History

Repository files navigation

Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks

Quick Start

Contents

What is included?

Model Architectures and Checkpoints

How I ran things

How I managed and installed dependencies

How I checked that everything worked before I ran things

How I trained models

How I ran checkpoints in the environment

How I prepared the dataset

License

Citation

Footnotes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 6

Languages