[widget] Aesthetic Biases of VQGAN and CLIP Checkpoints

The widget below illustrates how images generated in “VQGAN” mode are affected by the choice of VQGAN model and CLIP perceptor.

Press the “▷” icon to begin the animation.

The first run with any particular set of settings will probably show an empty image because the widget is janky and downloads only what it needs on the fly. What can I say: I’m an ML engineer, not a webdeveloper.

What is “VQGAN” mode?

VQGAN is a method for representing images implicitly, using a latent representation. The dataset the VQGAN model was trained on creates constraints on the kinds of images the model can generate, so different pre-trained VQGANs consequently can have their own respective characteristic looks, in addition to generating images that may have a kind of general “VQGAN” look to them.

The models used to score image-text similarity (usually a CLIP model) are also affected by the dataset they were trained on. Additionally, there are a couple of different structural configurations of CLIP models (resnet architectures vs transformers, fewer vs more parameters, etc.), and these configurational choices can affect the kinds of images that model will guide the VQGAN towards.

Finally, all of these components can interact. And really, the only way to understand the “look” of these models is to play with them and see for yourself. That’s what this page is for :)

Description of Settings in Widget

  • vqgan_model: The “name” pytti uses for a particular pre-trained VQGAN. The name is derived from the dataset used to train the model.

  • **mmc_model**: The identifer of the (CLIP) perceptor used by the mmc library, which pytti uses to load these models.

Widget

#import re
from pathlib import Path

import numpy as np
import pandas as pd
import panel as pn

pn.extension()

outputs_root = Path('images_out')
folder_prefix = 'exp_vqgan_base_perceptors' #'permutations_limited_palette_2D'
folders = list(outputs_root.glob(f'{folder_prefix}_*'))

def format_val(v):
    try:
        v = float(v)
        if int(v) == v:
            v = int(v)
    except:
        pass
    return v

# to do-fix this regex
def parse_folder_name(folder):
    #metadata_string = folder.name[1+len(folder_prefix):]
    #pattern = r"_?([a-zA-Z_]+)-([0-9.]+)"
    #matches = re.findall(pattern, metadata_string)
    #d_ = {k:format_val(v) for k,v in matches}
    _, metadata_string = folder.name.split('__')
    d_ = {k:1 for k in metadata_string.split('_')}
    d_['fpath'] = folder
    d_['n_images'] = len(list(folder.glob('*.png')))
    return d_

#let's just make each model a column
df_meta = pd.DataFrame([parse_folder_name(f) for f in folders]).fillna(0)

variant_names = [v for v in df_meta.columns.tolist() if v not in ['fpath']]
variant_ranges = {v:df_meta[v].unique() for v in variant_names}
[v.sort() for v in variant_ranges.values()]

###########################

url_prefix = "https://raw.githubusercontent.com/dmarx/pytti-settings-test/main/images_out/"

image_paths = [str(p) for p in Path('images_out').glob('**/*.png')]
d_image_urls = {im_path:im_path.replace('images_out/', url_prefix) for im_path in image_paths}

###########################

vqgan_selector = pn.widgets.Select(
    name='vqgan_model', 
    options=[
        'imagenet',
        'coco',
        'wikiart',
        'openimages',
        'sflckr'
    ], 
    value='sflckr',
)

#perceptor_selector = pn.widgets.MultiSelect(
perceptor_selector = pn.widgets.Select(
        name='mmc_models',
        options=[
            'RN101',
            'RN50',
            'RN50x4',
            'ViT-B16',
            'ViT-B32'
        ]
)

n_imgs_per_group = 40
step_selector = pn.widgets.Player(interval=100, name='step', start=1, end=n_imgs_per_group, step=1, value=1, loop_policy='reflect')

@pn.interact(
    vqgan_model=vqgan_selector,
    mmc_models=perceptor_selector,
    i=step_selector,
)
def display_images(
    vqgan_model,
    mmc_models,
    i,
):
    #mmc_idx = [df_meta[m] > 0 for m in mmc_models]
    #vqgan_model == 
    idx = np.ones(len(df_meta), dtype=bool)
    #for m in mmc_models:
    #    idx &= df_meta[m] > 0
    idx &= df_meta[mmc_models] > 0
    idx &= df_meta[vqgan_model] > 0

    folder = df_meta[idx]['fpath'].values[0]
    im_path = str(folder / f"{folder.name}_{i}.png")
    im_url = d_image_urls[im_path]
    #im_url = im_path
    return pn.pane.HTML(f'<img src="{im_url}" width="700">', width=700, height=350, sizing_mode='fixed')

pn.panel(display_images).embed(max_opts=n_imgs_per_group, max_states=999999999)

Settings shared across animations

‘cutouts’:60, ‘cut_pow’:1,

‘pixel_size’:1, ‘height’:128, ‘width’:256, ‘scenes’:‘“a photograph of a bright and beautiful spring day, by Trey Ratcliff || a painting of a cold wintery landscape, by Rembrandt “’, ‘scene_suffix’:’” | text:-1:-.9 | watermark:-1:-.9”’, ‘image_model’:”VQGAN”, ‘+use_mmc’:True, ‘steps_per_frame’:50, ‘steps_per_scene’:1000, ‘interpolation_steps’:500, ‘animation_mode’:”2D”, ‘translate_x’:-1, ‘zoom_x_2d’:3, ‘zoom_y_2d’:3, ‘seed’:12345,

scenes: "a photograph of a bright and beautiful spring day, by Trey Ratcliff || a painting of a cold wintery landscape, by Rembrandt "
scene_suffix: " | text:-1:-.9 | watermark:-1:-.9"

steps_per_frame: 50
save_every: 50
steps_per_scene: 1000
interpolation_steps: 500

animation_mode: "2D"
translate_x: -1
zoom_x_2d: 3
zoom_y_2d: 3

cutouts: 60
cut_pow: 1

seed: 12345

pixel_size: 1
height: 128
width: 256

###########################
# still need explanations #
###########################

init_image: GasWorksPark3.jpg
direct_stabilization_weight: 0.3
reencode_each_frame: false
reset_lr_each_frame: true
image_model: VQGAN
use_mmc: true

Detailed explanation of shared settings

(WIP)

scenes: "a photograph of a bright and beautiful spring day, by Trey Ratcliff || a painting of a cold wintery landscape, by Rembrandt "

We have two scenes (separated by ||) with one prompts each.

scene_suffix: " | text:-1:-.9 | watermark:-1:-.9"

We add prompts with negative weights (and ‘stop’ weights: prompt:weight:stop) to try to discourage generation of specific artifacts. Putting these prompts in the scene_suffix field is a shorthand for concatenating this prompts into all of the scenes. I find it also helps keep the settings a little more neatly organized by reducing clutter in the scenes field.

steps_per_frame: 50
save_every: 50
steps_per_scene: 1000

Pytti will take 50 optimization steps for each frame (i.e. image) of the animation.

We have two scenes: 1000 steps_per_scene / 50 steps_per_frame = 20 frames per scene = 40 frames total will be generated.

interpolation_steps: 500

a range of 500 steps will be treated as a kind of “overlap” between the two scenes to ease the transition from one scene to the next. This means for each scene, we’ll have 1000 - 500/2 = 750 steps = 15 frames that are just the prompt we specified for that scene, and 5 frames were the guiding prompts are constructed by interpolating (mixing) between the prompts of the two scenes. Concretely:

  • first 15 frames: only the prompt for the first scene is used

  • next 5 frames: we use the prompts from both scenes, weighting the first scene more heavily

  • next 5 frames: we use the prompts from both scenes, weighting the second scene more heavily

  • last 15 frames: only the prompt for the second scene is used.

image_model: VQGAN

We’re using the VQGAN mode described above, i.e. using a model designed to generate feasible images as a kind of constraint on the image generation process.

animation_mode: "2D"
translate_X: -1
zoom_x_2d: 3
zoom_y_2d: 3

After each frame is generated, we will initialize the next frame by scaling up (zooming into) the image a small amount, then shift it (translate) left (negative direction along x axis) a tiny bit. Even a tiny bit of “motion” tends to make for more interesting animations, otherwise the optimization process will converge and the image will stay relatively fixed.

cutouts: 60
cut_pow: 1

For each optimization step, we will take 60 random crops from the image to show the perceptor. cut_pow controls the size of these cutouts: 1 is generally a good default, smaller values create bigger cutouts. Generally, more cutouts = nicer images. Setting the number of cutouts too low can result in the image segmenting itself into regions: you can observe this phenomenon manifesting towards the end of many of the animations generated in this experiment. In addition to turning up the number of cutouts, this could also potentially be fixed be setting the cut_pow lower to ask the perceptor to score larger regions at a time.

seed: 12345

If a seed is not specified, one will be generated randomly. This value is used to initialize the random number generator: specifying a seed promotes deterministic (repeatable) behavior. This is an especially useful parameter to set for comparison studies like this.