Post

Embodied AI Paper Reviews

Imitation Learning

BC-Z

Gato

  • Did not evaluate generalization to new tasks.
  • Limited real-world data and evaluation.

Robocat

RT-1

  • It has its own action tokenization. It discretized each action dimension into 256 bins (7 dimensions for th robot in the paper).
  • It has a key insight that using continuous actions reduces the success rate by a lot.
    • Gaussian distribution in continous action set up is limited to a single-modal distribution.
  • Adding history observations matter.
  • Trajectory ${x_j, a_j}^T$, there is only a binary reward of success or not in the end.
  • Token Learner compresses large number of tokens to smaller number of tokens.
    • This makes the inference feasible on robot.
  • 3-10 Hz response latency.

PaLM-E

RT-2

  • 55B.
  • It uses the language token space. Either by outputting string (integers) directly, or swap the 256 least used text tokens.
  • Co-Fine-Tuning: train with both robotics data and the orginal web data.
  • Constrain the output token range when we really need the robot actions.
  • The net gains are from unseen objects / backgrounds / environments.
    • This is a very straightforward area of improvements.
  • This model is big and slow (plus sitting on the cloud), not that practical in actual robots.

RT-X

  • RT-1-X outperforms RT-1 (trained with the specific task data).
  • RT-1 architecture underfits on large dataset.
  • Unseen objects, backgrounds and environments: RT-2 and RT-2-X performs the same.
  • RT-2-X is able to perform tasks in Bridge much better even if it is cross-embodiment data.
  • RT-2-X without Bridge data still outperforms RT-2.

  • Cross Embodiment works!
  • And it helps the specialist model as well!!
  • It means let’s go for the multi-task-multi-embodiment data collection.
  • Datasets need combine both scale and breadth
  • Dataset needs to be sufficiently well-connected to enable generalization.
  • 22 robot embodiments, but it seems the datasets are all robot arms with 7 DOFs.

Offline Reinforcement Learning

Q-Transformer

Conservative Q-Learning

Decision Transformer

Limitation of the original paper:

  • It seems hard to generalize to the returns that it did not see before
  • The entire observation vector is converted to a single embedding
    • The embedding conversion is just a linear transformation, feel lack of the ability to project complicated states to the right place.
    • There is a paper did an ablation and found the discretization is better than using the continuous value.
      • I guess the reason is each bin could be saved into an embedding lookup table, making it more nonlinear.
  • At the inference time, the latest return condition prompt = last return expectation - last reward. This could result in the latest return going to some range that the model does not see before in the training data.

    • So in general, the offline dataset basically need to cover the entire return space to perform well.
  • Another thing I notice in my experiment, it randomly stitches different expert demos, but did not result in a better solution.

Prompting-based

Code-as-Policy

Language to Rewards

Learning to Learn faster

  • Daytime in-context-learning based on instruction.
  • Nighttime, fine-tune and update the weights.
This post is licensed under CC BY 4.0 by the author.

Comments powered by Disqus.