πŸ”„Understanding Workflow

VOXRAD uses two ways to transcribe audio to report.

  1. Use a combination of using a transcription model to first transcribe audio and then format and restructure the transcript using instruction template.

  2. Use a multimodal model to directly input the audio and instruction template to provide output.

Important: If using the first method, you will have to provide two keys, i.e., for "Transcription Model" and "Text Model" in the settings. For using the second method only a single "Multimodal Model" key is required.

Supported LLMs

There are 3 types of LLMs supported by the application.

Model
Capabilities

Transcription Model

Transcribes audio to text. Models like whisper. Most API services have an upper limit of 25 MB.

Text Model

Utilizes the transcript and instruction template to generate response. Models like gpt-4 and Llama 3.

Multimodal Model

Can directly use user's recorded audio and instruction template to generate output. Models like gemini-1.5-flash

Last updated