Which is the gradient of the knowledge distillation loss?
The gradient of the knowledge distillation loss are the logits of the large model. For large values of , i.e. the loss is equivalent to matching the logits of the two models, as done in model compression. ^ a b c d e f gHinton, Geoffrey; Vinyals, Oriol; Dean, Jeff (2015).
DA: 28 PA: 11 MOZ Rank: 51