> ## Documentation Index
> Fetch the complete documentation index at: https://docs.comfy.org/llms.txt
> Use this file to discover all available pages before exploring further.

# ElevenLabsSpeechToText - ComfyUI Built-in Node Documentation

> Complete documentation for the ElevenLabsSpeechToText node in ComfyUI. Learn its inputs, outputs, parameters and usage.

> This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! [Edit on GitHub](https://github.com/Comfy-Org/embedded-docs/blob/main/comfyui_embedded_docs/docs/ElevenLabsSpeechToText/en.md)

The ElevenLabs Speech to Text node transcribes audio files into text. It uses ElevenLabs' API to convert spoken words into a written transcript, supporting features like automatic language detection, identifying different speakers, and tagging non-speech sounds like music or laughter.

## Inputs

| Parameter                | Data Type | Required | Range                                     | Description                                                                                                                                                                                       |
| ------------------------ | --------- | -------- | ----------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `audio`                  | AUDIO     | Yes      | -                                         | Audio to transcribe.                                                                                                                                                                              |
| `model`                  | COMBO     | Yes      | `"scribe_v2"`                             | Model to use for transcription. Selecting this model reveals additional parameters.                                                                                                               |
| `tag_audio_events`       | BOOLEAN   | No       | -                                         | Annotate sounds like (laughter), (music), etc. in transcript. This parameter is revealed when the `"scribe_v2"` model is selected. (default: False)                                               |
| `diarize`                | BOOLEAN   | No       | -                                         | Annotate which speaker is talking. This parameter is revealed when the `"scribe_v2"` model is selected. (default: False)                                                                          |
| `diarization_threshold`  | FLOAT     | No       | 0.1 - 0.4                                 | Speaker separation sensitivity. Lower values are more sensitive to speaker changes. This parameter is revealed when the `"scribe_v2"` model is selected and `diarize` is enabled. (default: 0.22) |
| `temperature`            | FLOAT     | No       | 0.0 - 2.0                                 | Randomness control. 0.0 uses model default. Higher values increase randomness. This parameter is revealed when the `"scribe_v2"` model is selected. (default: 0.0)                                |
| `timestamps_granularity` | COMBO     | No       | `"word"`<br />`"character"`<br />`"none"` | Timing precision for transcript words. This parameter is revealed when the `"scribe_v2"` model is selected. (default: "word")                                                                     |
| `language_code`          | STRING    | No       | -                                         | ISO-639-1 or ISO-639-3 language code (e.g., 'en', 'es', 'fra'). Leave empty for automatic detection. (default: "")                                                                                |
| `num_speakers`           | INT       | No       | 0 - 32                                    | Maximum number of speakers to predict. Set to 0 for automatic detection. (default: 0)                                                                                                             |
| `seed`                   | INT       | No       | 0 - 2147483647                            | Seed for reproducibility (determinism not guaranteed). (default: 1)                                                                                                                               |

**Note:** The `num_speakers` parameter cannot be set to a value greater than 0 when the `diarize` option is enabled. You must either disable `diarize` or set `num_speakers` to 0.

## Outputs

| Output Name     | Data Type | Description                                                                                                             |
| --------------- | --------- | ----------------------------------------------------------------------------------------------------------------------- |
| `text`          | STRING    | The transcribed text from the audio.                                                                                    |
| `language_code` | STRING    | The detected language code of the audio.                                                                                |
| `words_json`    | STRING    | A JSON-formatted string containing detailed word-level information, including timestamps and speaker labels if enabled. |
