The things I see as necessary are characters, actions & events, props & costumes, and settings, pretty much in that order of importance. Actions should indicate who performed them, and to whom they were performed. (Costumes only need to be included if they're different from their day-to-day clothing, like Doc's jersey or Sandy's harem girl... whatever that's called.)
Most of my searches are for events, followed closely by dialogue. But encoding events is going to take more brain cycles than just transcribing dialogue; it'll take some thought to create descriptions of the action that's normally shown or implied by the art.
Would reading the dialogue and narrating the action into a text-to-speech program save some times? It wouldn't be ready to publish without a lot of editing, but it might save some time getting the initial words on disk.