Learning dexterity

Our system, called Dactyl, is trained entirely in simulation and transfers its knowledge to reality, adapting to real-world physics using techniques we’ve been working on for thepast⁠year⁠. Dactyl learns from scratch using the same general-purpose reinforcement learning algorithm and code asOpenAI Five⁠. Ourresults⁠show that it’s possible to train agents in simulation and have them solve real-world tasks, without physically-accurate modeling of the world.

Examples of dexterous manipulation behaviors autonomously learned by Dactyl.

Dactyl is a system for manipulating objects using aShadow Dexterous Hand⁠(opens in a new window). We place an object such as a block or a prism in the palm of the hand and ask Dactyl to reposition it into a different orientation; for example, rotating the block to put a new face on top. The network observes only the coordinates of the fingertips and the images from three regular RGB cameras.

Although the first humanoid hands were developed decades ago, using them to manipulate objects effectively has been a long-standing challenge in robotic control. Unlike other problems such aslocomotion⁠(opens in a new window), progress on dextrous manipulation using traditional robotics approaches has been slow, andcurrent techniques⁠(opens in a new window)remain limited in their ability to manipulate objects in the real world.

Reorienting an object in the hand requires the following problems to be solved:

Dactyl learns to solve the object reorientation task entirely in simulation without any human input. After this training phase, the learned policy works on the real robot without any fine-tuning.

Learning methods for robotic manipulation face a dilemma. Simulated robots can easily provide enough data to train complex policies, but most manipulation problems can’t be modeled accurately enough for those policies to transfer to real robots. Even modeling what happens when two objects touch—the most basic problem in manipulation—is anactive area of research⁠(opens in a new window)with no widely accepted solution. Training directly on physical robots allows the policy to learn from real-world physics, but today’s algorithms would require years of experience to solve a problem like object reorientation.

Our approach,_domain randomization_, learns in a simulation which is designed to provide a variety of experiences rather than maximizing realism. This gives us the best of both approaches: by learning in simulation, we can gather more experience quickly by scaling up, and by de-emphasizing realism, we can tackle problems that simulators can only model approximately.

It’s been shown (byOpenAI⁠andothers⁠(opens in a new window)) that domain randomization can work on increasingly complex problems—domain randomizations were evenused to train OpenAI Five⁠. Here, we wanted to see if scaling up domain randomization could solve a task well beyond the reach of current methods in robotics.

We built asimulated version⁠of our robotics setup using theMuJoCo⁠(opens in a new window)physics engine. This simulation is only a coarse approximation of the real robot:

The simulation can be made more realistic by calibrating its parameters to match robot behavior, but many of these effects simply cannot be modeled accurately in current simulators.

Instead, we train the policy on a distribution of simulated environments where the physical and visual attributes are chosen randomly. Randomized values are a natural way to represent the uncertainties that we have about the physical system and also prevent overfitting to a single simulated environment. If a policy can accomplish the task across all of the simulated environments, it will more likely be able to accomplish it in the real world.

## Learning to control

By building simulations that support transfer, we have reduced the problem of controlling a robot in the real world to accomplishing a task in simulation, which is a problem well-suited for reinforcement learning. While the task of manipulating an object in a simulated hand is alreadysomewhat difficult⁠, learning to do so across all combinations of randomized physical parameters is substantially more difficult.

To generalize across environments, it is helpful for the policy to be able to take different actions in environments with different dynamics. Because most dynamics parameters cannot be inferred from a single observation, we used anLSTM⁠(opens in a new window)—a type of neural network with memory—to make it possible for the network to learn about the dynamics of the environment. The LSTM achieved about twice as many rotations in simulation as a policy without memory.

Dactyl learns usingRapid⁠, the massively scaled implementation of Proximal Policy Optimization developed to allow OpenAI Five to solve Dota 2. We use a different model architecture, environment, and hyperparameters than OpenAI Five does, but we use the same algorithms and training code. Rapid used 6144 CPU cores and 8 GPUs to train our policy, collecting about one hundred years of experience in 50 hours.

For development and testing, we validated our control policy against objects with embedded motion tracking sensors to isolate the performance of our control and vision networks.

Dactyl was designed to be able to manipulate arbitrary objects, not just those that have been specially modified to support tracking. Therefore, Dactyl uses regular RGB camera images to estimate the position and orientation of the object.

We train a pose estimator using a convolutional neural network. The neural network takes the video streams from three cameras positioned around the robot hand and outputs the estimated position and orientation of the object. We use multiple cameras to resolve ambiguities and occlusion. We again use domain randomization to train this network only in simulation using theUnity⁠(opens in a new window)game development platform, which can model a wider variety of visual phenomena than Mujoco.

By combining these two independent networks, the control network that reorients the object given its pose and the vision network that maps images from cameras to the object’s pose, Dactyl can manipulate an object by seeing it.

Example training images used for learning to estimate the pose of the block.

When deploying our system, we noticed that Dactyl uses a rich set ofin-hand dexterous manipulation strategies⁠(opens in a new window)to solve the task. These strategies are commonly used by humans as well. However, we do not teach them to our system explicitly; all behaviors are discovered autonomously.

Examples of dexterous manipulation behaviors autonomously learned by Dactyl.

Dactyl grasp types according to theGRASP taxonomy⁠(opens in a new window). Top left to bottom right: Tip Pinch, Palmar Pinch, Tripod, Quadpod, Power grasp, and 5-Finger Precision grasp.

We observed that for precision grasps, such as the Tip Pinch grasp, Dactyl uses the thumb and little finger. Humans tend to use the thumb and either the index or middle finger instead. However, the robot hand’s little finger is more flexible due to anextra degree of freedom⁠(opens in a new window), which may explain why Dactyl prefers it. This means that Dactyl can rediscover grasps found in humans, but adapt them to better fit the limitations and abilities of its own body.

## Transfer performance

We tested how many rotations Dactyl could achieve before it dropped the object, timed out, or reached 50 successes. Our policies trained purely in simulation were able to successfully manipulate objects in the real world.

Dactyl lab setup with Shadow Dexterous Hand, PhaseSpace motion tracking cameras, and Basler RGB cameras.

For the task of block manipulation, policies trained with randomization could achieve many more rotations than those trained without randomization, as can be seen in the results below. Also, using the control network with pose estimated from vision performs nearly as well as reading the pose directly from motion tracking sensors.

RandomizationsObject trackingMax number of successesMedian number of successes All randomizations Vision 46 11.5 All randomizations Motion tracking 50 13 No randomizations Motion tracking 6 0

## Learning progress

The vast majority of training time is spent making the policy robust to different physical dynamics. Learning to rotate an object in simulation without randomizations requires about 3 years of simulated experience, while achieving similar performance in a fully randomized simulation requires about 100 years of experience.

Learning progress with and without randomizations over years of simulated experience.

## What surprised us

## What didn’t pan out

We also found to our surprise that a number of commonly employed techniques did not improve our results.

This project completes a full cycle of AI development that OpenAI has been pursuing for the past two years: we’ve developeda new learning algorithm⁠, scaled it massively to solvehard simulated tasks⁠, and then applied the resulting system to the real world. Repeating this cycle atincreasing scale⁠is the primary route we are pursuing to increase the capabilities of today’s AI systems towards safe artificial general intelligence. If you’d like to be part of what comes next,we’re hiring⁠!

Josh Tobin, Bob McGrew, Wojciech Zaremba, Maciek Chociej, Szymon Sidor, Glenn Powell, Jakub Pachocki, Alex Ray, Marcin Andrychowicz, Bowen Baker, Arthur Petron, Matthias Plappert, Rafał Józefowicz, Jonas Schneider, Peter Welinder, Lilian Weng

Larissa Schiavo, Jack Clark, Greg Brockman, Ilya Sutskever, Diane Yoon, Ben Barry

Thanks to the following for feedback on drafts of this post: Pieter Abbeel, Tamim Asfour, Marek Cygan, Ken Goldberg, Anna Goldie, Edward Mehr, Azalia Mirhoseini, Lerrel Pinto, Aditya Ramesh & Ian Rust.

Ben Barry & Eric Haines

CLIP: Connecting text and images Milestone Jan 5, 2021

Solving Rubik’s Cube with a robot hand Milestone Oct 15, 2019

Retro Contest: Results Conclusion Jun 22, 2018

Our Research * Research Index * Research Overview * Research Residency * OpenAI for Science * Economic Research

Latest Advancements * GPT-5.3 Instant * GPT-5.3-Codex * GPT-5 * Codex

Safety * Safety Approach * Security & Privacy * Trust & Transparency

ChatGPT * Explore ChatGPT(opens in a new window) * Business * Enterprise * Education * Pricing(opens in a new window) * Download(opens in a new window)

Sora * Sora Overview * Features * Pricing * Sora log in(opens in a new window)

API Platform * Platform Overview * Pricing * API log in(opens in a new window) * Documentation(opens in a new window) * Developer Forum(opens in a new window)

For Business * Business Overview * Solutions * Contact Sales

Company * About Us * Our Charter * Foundation * Careers * Brand

Support * Help Center(opens in a new window)

More * News * Stories * Livestreams * Podcast * RSS

Terms & Policies * Terms of Use * Privacy Policy * Other Policies

(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)

English United States

Learning dexterity

The unpaid, unrecognised burden of the women-led care economy of India

Andrej Karpathy Transitions from Coding to Directing AI Agents

Musk and Hassabis Discuss AI's Impact on Scientific Discovery

Perfios Reports 46% Profit Increase to ₹104 Cr in FY25, Revenue Surpasses ₹700 Cr

Latest Briefs