πUnderstanding Workflow
VOXRAD uses two ways to transcribe audio to report.
Use a combination of using a transcription model to first transcribe audio and then format and restructure the transcript using instruction template.
Use a multimodal model to directly input the audio and instruction template to provide output.
Supported LLMs
There are 3 types of LLMs supported by the application.
Transcription Model
Transcribes audio to text. Models like whisper. Most API services have an upper limit of 25 MB.
Text Model
Utilizes the transcript and instruction template to generate response. Models like gpt-4 and Llama 3.
Multimodal Model
Can directly use user's recorded audio and instruction template to generate output. Models like gemini-1.5-flash
Only gemini-1.5-pro and gemini-1.5-flash multimodal models are supported experimentally. GPT-4o will be supported as the API becomes available. As these are remotely hosted LLMs should not be used for any sensitive data.
Last updated