[widget] Video Source Stabilization (part 1)

The widget below illustrates how images generated using animation_mode: Video Source are affected by certain “stabilization” options.

Press the “▷” icon to begin the animation.

The first run with any particular set of settings will probably show an empty image because the widget is janky and downloads only what it needs on the fly. What can I say: I’m an ML engineer, not a webdeveloper.

What is “Video Source” animation mode?

PyTTI generates images by iterative updates. This process can be initialized in a variety of ways, and depending on how certain settings are configured, the initial state can have a very significant impact on the final result. For example, if we set the number of steps or the learning rate very low, the final result might be barely modified from the initial state. PyTTI’s default behavior is to initialize this process using random noise (i.e. an image of fuzzy static). If we provide an image to use for the starting state of this process, the “image generation” can become more of an “image manipulation”. A video is just a sequence of images, so we can use pytti as a tool for manipulating an input video sequence similar to how pytti can be used to manipulate an input image.

Generating a sequence of images for an animation often comes with some additional considerations. In particular: we often want to be able to control frame-to-frame coherence. Using adjacent video frames as init images to generate adjacent frames of an animation is a good way to at least guarantee some structural coherence in terms of the image layout, but otherwise the images will be generated independently of each other. A single frame of an animation generated this way will probably look fine in isolation, but as part of an animation sequence it might create a kind of undesirable flickering as manifestations of objects in the image change without regard to what they looked like in the previous frame.

To resolve this, PyTTI provides a variety of mechanisms for encouraging an image generation to conform to attributes of either the input video, previously generated animation frames, or both.

The following widget uses the VQGAN image model. You can aboslutely use other image models for video source animations, but generally we find this is what people are looking for. There will be some artifacts in the animations generated here as a consequence of the low output resolution used, so keep in mind that VQGAN outputs don’t need to be as “blocky” as those illustrated here. The resolution in this experiment was kept low to generate the demonstration images faster.

Description of Settings in Widget

  • reencode_each_frame: Use each video frame as an init_image instead of warping each output frame into the init for the next. Cuts will still be detected and trigger a reencode.

  • direct_stabilization_weight: Use the current frame of the video as a direct image prompt.

  • semantic_stabilization_weight: Use the current frame of the video as a semantic image prompt

Widget

import re
from pathlib import Path

from IPython.display import display, clear_output, Image, Video
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import panel as pn

pn.extension()

#########

outputs_root = Path('images_out')
folder_prefix = 'exp_video_basic_stability_modes'
folders = list(outputs_root.glob(f'{folder_prefix}_*'))


def format_val(v):
    try:
        v = float(v)
        if int(v) == v:
            v = int(v)
    except:
        pass
    return v

def parse_folder_name(folder):
    metadata_string = folder.name[1+len(folder_prefix):]
    pattern = r"_?([a-zA-Z_]+)-(True|False|[0-9.]+)"
    matches = re.findall(pattern, metadata_string)
    d_ = {k:format_val(v) for k,v in matches}
    d_['fpath'] = folder
    d_['n_images'] = len(list(folder.glob('*.png')))
    return d_

df_meta = pd.DataFrame([parse_folder_name(f) for f in folders])

variant_names = [v for v in df_meta.columns.tolist() if v not in ['fpath']]
variant_ranges = {v:df_meta[v].unique() for v in variant_names}
[v.sort() for v in variant_ranges.values()]


##########################################

n_imgs_per_group = 20

def setting_name_shorthand(setting_name):
    return ''.join([tok[0] for tok in setting_name.split('_')])

decoded_setting_name = {
    'ref': 'reencode_each_frame',
    'dsw': 'direct_stabilization_weight',
    'ssw': 'semantic_stabilization_weight',
}

kargs = {k:pn.widgets.DiscreteSlider(name=decoded_setting_name[k], options=list(v), value=v[0]) for k,v in variant_ranges.items() if k != 'n_images'}
kargs['i'] = pn.widgets.Player(interval=300, name='step', start=1, end=n_imgs_per_group, step=1, value=1, loop_policy='reflect')


url_prefix = "https://raw.githubusercontent.com/dmarx/pytti-settings-test/main/images_out/"
image_paths = [str(p) for p in Path('images_out').glob('**/*.png')]
d_image_urls = {im_path:im_path.replace('images_out/', url_prefix) for im_path in image_paths}

##########

@pn.interact(
    **kargs
)
def display_images(
    ref,
    dsw,
    ssw,
    i,
):
    folder = df_meta[
        (ref == df_meta['ref']) &
        (dsw == df_meta['dsw']) &
        (ssw == df_meta['ssw'])
    ]['fpath'].values[0]
    im_path = str(folder / f"{folder.name}_{i}.png")
    #im_url = im_path
    im_url = d_image_urls[im_path]
    return pn.pane.HTML(f'<img src="{im_url}" width="700">', width=700, height=350, sizing_mode='fixed')

pn.panel(display_images, height=1000).embed(max_opts=n_imgs_per_group, max_states=999999999)

Settings shared across animations

scenes: "a photograph of a bright and beautiful spring day, by Trey Ratcliff"
scene_suffix: " | text:-1:-.9 | watermark:-1:-.9"

animation_mode: "Video Source"
video_path: "/home/dmarx/proj/pytti-book/pytti-core/src/pytti/assets/HebyMorgongava_512kb.mp4"
frames_per_second: 15
backups: 3

steps_per_frame: 50
save_every: 50
steps_per_scene: 1000

image_model: "VQGAN"

cutouts: 40
cut_pow: 1

pixel_size: 1
height: 512
width: 1024

seed: 12345

Detailed explanation of shared settings

(WIP)

scenes: "a photograph of a bright and beautiful spring day, by Trey Ratcliff"
scene_suffix: " | text:-1:-.9 | watermark:-1:-.9"

Guiding text prompts.

animation_mode: "Video Source"
video_path: "/home/dmarx/proj/pytti-book/pytti-core/src/pytti/assets/HebyMorgongava_512kb.mp4"

It’s generally a good idea to specify the path to files using an “absolute” path (starting from the root folder of the file system, in this case “/”) rather than a “relative” path (‘relative’ with respect to the current folder). This is because depending on how we run pytti, it may actually change the current working directory. One of many headaches that comes with Hydra, which powers pytti’s CLI and config system.

frames_per_second: 15

The video source file will be read in using ffmpeg, which will decode the video from its original frame rate to 15 FPS.

backups: 3

This is a concern that should totally be abstracted away from the user and I’m sorry I haven’t taken care of it already. If you get errors saying something like pytti can’t find a file named ...*.bak, try setting backups to 0 or incrementing the number of backups until the error goes away. Let’s just leave it at that for now.

steps_per_frame: 50
save_every: 50
steps_per_scene: 1000

Pytti will take 50 optimization steps for each frame (i.e. image) of the animation.

We have one scenes: 1000 steps_per_scene / 50 steps_per_frame = 20 frames total will be generated.

At 15 FPS, we’ll be manipulating 1.3 seconds of video footage. If the input video is shorter than the output duration calculated as a function of frames (like we just computed here), the animation will end when we run out of input video frames.

To apply PyTTI to an entire input video: set steps_per_scene to an arbitrarily high value.

image_model: VQGAN

We choose the vqgan model here because it’s essentially a short-cut to photorealistic outputs.

cutouts: 40
cut_pow: 1

For each optimization step, we will take 60 random crops from the image to show the perceptor. cut_pow controls the size of these cutouts: 1 is generally a good default, smaller values create bigger cutouts. Generally, more cutouts = nicer images. If we set reencode_each_frame: False, we can sort of “accumulate” cutout information in the VQGAN latent, which will get carried from frame-to-frame rather than being re-initialized each frame. Sometimes this will be helpful, sometimes it won’t.

seed: 12345

If a seed is not specified, one will be generated randomly. This value is used to initialize the random number generator: specifying a seed promotes deterministic (repeatable) behavior. This is an especially useful parameter to set for comparison studies like this.