Understanding Workflow
VOXRAD uses two ways to transcribe audio to report.
Use a combination of using a transcription model to first transcribe audio and then format and restructure the transcript using instruction template.
Use a multimodal model to directly input the audio and instruction template to provide output.
Important: If using the first method, you will have to provide two keys, i.e., for "Transcription Model" and "Text Model" in the settings. For using the second method only a single "Multimodal Model" key is required.
Supported LLMs
There are 3 types of LLMs supported by the application.
Transcription Model
Transcribes audio to text. Models like whisper
. Most API services have an upper limit of 25 MB.
Text Model
Utilizes the transcript and instruction template to generate response. Models like gpt-4
and Llama 3
.
Multimodal Model
Can directly use user's recorded audio and instruction template to generate output. Models like gemini-1.5-flash
Only gemini-1.5-pro
and gemini-1.5-flash
multimodal models are supported experimentally. GPT-4o
will be supported as the API becomes available. As these are remotely hosted LLMs should not be used for any sensitive data.
Last updated