ElevenLabsSpeechToText - ComfyUI Built-in Node Documentation

The ElevenLabs Speech to Text node transcribes audio files into text. It uses ElevenLabs’ API to convert spoken words into a written transcript, supporting features like automatic language detection, identifying different speakers, and tagging non-speech sounds like music or laughter.

Inputs

Parameter	Description	Data Type	Required	Range
`audio`	Audio to transcribe.	AUDIO	Yes	-
`model`	Model to use for transcription. Selecting this model reveals additional parameters.	COMBO	Yes	`"scribe_v2"`
`tag_audio_events`	Annotate sounds like (laughter), (music), etc. in transcript. This parameter is revealed when the `"scribe_v2"` model is selected. (default: False)	BOOLEAN	No	-
`diarize`	Annotate which speaker is talking. This parameter is revealed when the `"scribe_v2"` model is selected. (default: False)	BOOLEAN	No	-
`diarization_threshold`	Speaker separation sensitivity. Lower values are more sensitive to speaker changes. This parameter is revealed when the `"scribe_v2"` model is selected and `diarize` is enabled. (default: 0.22)	FLOAT	No	0.1 - 0.4
`temperature`	Randomness control. 0.0 uses model default. Higher values increase randomness. This parameter is revealed when the `"scribe_v2"` model is selected. (default: 0.0)	FLOAT	No	0.0 - 2.0
`timestamps_granularity`	Timing precision for transcript words. This parameter is revealed when the `"scribe_v2"` model is selected. (default: “word”)	COMBO	No	`"word"` `"character"` `"none"`
`language_code`	ISO-639-1 or ISO-639-3 language code (e.g., ‘en’, ‘es’, ‘fra’). Leave empty for automatic detection. (default: "")	STRING	No	-
`num_speakers`	Maximum number of speakers to predict. Set to 0 for automatic detection. (default: 0)	INT	No	0 - 32
`seed`	Seed for reproducibility (determinism not guaranteed). (default: 1)	INT	No	0 - 2147483647

Note: The num_speakers parameter cannot be set to a value greater than 0 when the diarize option is enabled. You must either disable diarize or set num_speakers to 0.

Outputs

Output Name	Description	Data Type
`text`	The transcribed text from the audio.	STRING
`language_code`	The detected language code of the audio.	STRING
`words_json`	A JSON-formatted string containing detailed word-level information, including timestamps and speaker labels if enabled.	STRING

This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub

Source fingerprint (SHA-256): 7eb5d72615aa8a9e4a8014e45b39cf83dc8d8432d7ce0dccba20489be80a5830

​Inputs

​Outputs

Inputs

Outputs