Skip to main content
This node prepares data for training by encoding images and text. It takes a list of images and a corresponding list of text captions, then uses a VAE model to convert the images into latent representations and a CLIP model to convert the text into conditioning data. The resulting paired latents and conditioning are output as lists, ready for use in training workflows.

Inputs

ParameterDescriptionData TypeRequiredRange
imagesList of images to encode.IMAGEYesN/A
vaeVAE model for encoding images to latents.VAEYesN/A
clipCLIP model for encoding text to conditioning.CLIPYesN/A
textsList of text captions. Can be length n (matching images), 1 (repeated for all), or omitted (uses empty string).STRINGNoN/A
Parameter Constraints:
  • The number of items in the texts list must be 0, 1, or exactly match the number of items in the images list. If it is 0, an empty string is used for all images. If it is 1, that single text is repeated for all images.

Outputs

Output NameDescriptionData Type
latentsList of latent dicts.LATENT
conditioningList of conditioning lists.CONDITIONING
This documentation was AI-generated. If you find any errors or have suggestions for improvement, please feel free to contribute! Edit on GitHub

Source fingerprint (SHA-256): 72f1686aa9da9d50b1948040c323c7e944d4a5c1f4cd2ec5e0987d998c20ea43