Transcribe Files in Parallel

In this example, we will transcribe multiple audio files in parallel. We will use the Python multiprocessing framework to parallelize the transcriptions.

First we define the process_file() function, which processes an individual audio file in a streaming mode:

  • We create a corresponding text file for the audio file to store transcript.
  • On each result, we write the words into the text file and update the total processed audio duration.
  • At the end, we write the total processed audio duration into the text file.
def process_file(audio_fn: str) -> None:
    with Client() as client:
        # Create output text file.
        output_fn = audio_fn + ".txt"
        with open(output_fn, "w") as output_file:
            # Keep track of total processed audio duration.
            duration_ms = 0
            # Trancribe file.
            for result in transcribe_file_stream(audio_fn, client):
                words = [word.text for word in result.words]
                output_file.write(" ".join(words) + "\n")
                duration_ms = result.final_proc_time_ms
            # Write total processed audio duration.
            output_file.write(f"duration_ms={duration_ms}\n")

In the main() function, we:

  • Get all the specified audio filenames.
  • Create a pool of subprocesses for parallelization.
  • Process the audio files in parallel using pool.imap_unordered().
  • Print out the number of processed files so far.
def main() -> None:
    parser = argparse.ArgumentParser(
        description="Transcribe audio files in parallel.")
    parser.add_argument(
        "--glob_pathname", type=str, help="Audio files to transcribe."
    )
    parser.add_argument("--processes", type=int, default=4)
    args = parser.parse_args()
    assert args.processes >= 1

    # Get audio files to be transcibed.
    audio_files = glob.glob(args.glob_pathname)
    if len(audio_files) == 0:
        raise Exception("No files found.")

    # Transcribe in parallel with specified number of processes.
    with Pool(args.processes) as pool:
        for i, _ in enumerate(pool.imap_unordered(process_file, audio_files)):
            print(f"\rFinished {i+1} files out of {len(audio_files)}", end="")
        print()

if __name__ == "__main__":
    main()

When the program finishes, each audio file should have a corresponding text file with transcript.

Run!

examples/transcribe_files_parallel.py GitHub

python3 transcribe_files_parallel.py --glob_pathname="MY-AUDIO-DIR/*.mp3"