The result of this stage of our processing is a defined area around each detected mouth. We extract 121x121 pixels centered on the mouth’s central pixel for 21 frames from the compressed video (for the Amatitlania nigrofasciata and Hemichromis bimaculatus) or 241x241 pixels for 41 frames from the original hi-res video (for the slower eating Sparus aurata). The choice of spatial dimensions allows coverage of entire heads, along with sufficient margins for possible food floating around the fish. Temporal dimension were empirically determined to be long enough to span feeding. Figure 5 depicts frames from an example pose-normalized video volume of a feeding fish.
Extracted spatio-temporal volume in canonical views (horizontal, right-facing views) of a feeding fish. The prey is marked by a red circle, and enters the mouth at 60 ms. The mouth is closed at 120 ms.