Skip to main content
The ElevenLabs Speech to Text node transcribes audio files into text. It uses ElevenLabs’ API to convert spoken words into a written transcript, supporting features like automatic language detection, identifying different speakers, and tagging non-speech sounds like music or laughter.

Inputs

ParameterDescriptionData TypeRequiredRange
audioAudio to transcribe.AUDIOYes-
modelModel to use for transcription. Selecting this model reveals additional parameters.COMBOYes"scribe_v2"
tag_audio_eventsAnnotate sounds like (laughter), (music), etc. in transcript. This parameter is revealed when the "scribe_v2" model is selected. (default: False)BOOLEANNo-
diarizeAnnotate which speaker is talking. This parameter is revealed when the "scribe_v2" model is selected. (default: False)BOOLEANNo-
diarization_thresholdSpeaker separation sensitivity. Lower values are more sensitive to speaker changes. This parameter is revealed when the "scribe_v2" model is selected and diarize is enabled. (default: 0.22)FLOATNo0.1 - 0.4
temperatureRandomness control. 0.0 uses model default. Higher values increase randomness. This parameter is revealed when the "scribe_v2" model is selected. (default: 0.0)FLOATNo0.0 - 2.0
timestamps_granularityTiming precision for transcript words. This parameter is revealed when the "scribe_v2" model is selected. (default: “word”)COMBONo"word"
"character"
"none"
language_codeISO-639-1 or ISO-639-3 language code (e.g., ‘en’, ‘es’, ‘fra’). Leave empty for automatic detection. (default: "")STRINGNo-
num_speakersMaximum number of speakers to predict. Set to 0 for automatic detection. (default: 0)INTNo0 - 32
seedSeed for reproducibility (determinism not guaranteed). (default: 1)INTNo0 - 2147483647
Note: The num_speakers parameter cannot be set to a value greater than 0 when the diarize option is enabled. You must either disable diarize or set num_speakers to 0.

Outputs

Output NameDescriptionData Type
textThe transcribed text from the audio.STRING
language_codeThe detected language code of the audio.STRING
words_jsonA JSON-formatted string containing detailed word-level information, including timestamps and speaker labels if enabled.STRING
This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub

Source fingerprint (SHA-256): 1541ae5542b83d80cb96ab0c8694c25b2bd9d4f10c1030902064d05319b2a520