Current state-of-the-art action classification methods extract feature representations from the entire video clip in which the action unfolds, however this representation may include irrelevant scen