🔄Understanding Workflow

VOXRAD uses two ways to transcribe audio to report.

Use a combination of using a transcription model to first transcribe audio and then format and restructure the transcript using instruction template.
Use a multimodal model to directly input the audio and instruction template to provide output.

Important: If using the first method, you will have to provide two keys, i.e., for "Transcription Model" and "Text Model" in the settings. For using the second method only a single "Multimodal Model" key is required.

Supported LLMs

There are 3 types of LLMs supported by the application.

Model

Capabilities

Transcription Model

Transcribes audio to text. Models like whisper. Most API services have an upper limit of 25 MB.

Text Model

Utilizes the transcript and instruction template to generate response. Models like gpt-4 and Llama 3.

Multimodal Model

Can directly use user's recorded audio and instruction template to generate output. Models like gemini-1.5-flash

Only gemini-1.5-pro and gemini-1.5-flash multimodal models are supported experimentally. GPT-4o will be supported as the API becomes available. As these are remotely hosted LLMs should not be used for any sensitive data.

PreviousInstall NextCustomizing Templates

Last updated 11 months ago