Pairs of (observation, action) are collected from expert demonstrations. A model (policy network) is trained to map observations to actions by minimising MSE or cross-entropy. In BC the model learns off-policy โ without environment interaction during training. In more advanced variants (DAgger) the agent queries the expert in-the-loop to correct distribution shift errors.
Difficulty of defining reward functions for complex robotic tasks; need for efficient skill transfer from human demonstrations.
BC trains on expert demonstrations, but the robot encounters unseen states during deployment. Small errors compound (covariate shift), leading to catastrophic trajectories.
BC is only as good as the expert data โ monotonous or erroneous demonstrations directly degrade the policy. Collecting data from multiple experts under varied conditions is costly.
Pure BC has no recovery mechanism โ the robot does not know how to return to a safe state after deviating from the expert trajectory. Requires augmentation with DAgger or RL.
Training neural network policies on large demonstration datasets requires GPUs.