Making a Music Video Using Deep Learning - Neural Style Transfer and Stable Diffusion

“I Am Tracking You” - our new EDM track is almost ready. It’s about the intrusive nature of spyware and a jilted lover seeking revenge on her ex-boyfriend.

For the music video, I am looking to experiment with neural style transfer. Making the video in the style of Cyberpunk 2077 would be quite appropriate for the lyrics and the story we are trying to convey. To tell the story, I am considering shooting video and then postprocessing it so that it looks like it lives in the world of Cyberpunk 2077 or perhaps Bladerunner. Before actually taking video, I need to experiment with different options to get a feel of what’s possible.

The two main approaches that come to mind are neural style transfer and generating transformed images with stable diffusion.

Here is an experiment with neural style transfer comparing ReReVST vs Magenta.

As you can see ReReVST works very well but the Magenta model fails miserably. I checked out the VGG19 predictions of the image and it seems that the model is not recognizing any of the concepts in the image and thus neural style transfer is not really respecting the concepts in the image.

The next example uses img2img from Stable Diffusion 1.4 to completely repaint an image. The repainting is then propagated to adjacent frames by analyzing the motion in the picture. This video sequence is fairly stable as it’s just a zoom shot but you can see that some of the elements like the animation of the computer screens or the animation in the background is not transferring well.

The last example uses a model that was finetuned to be in the cyberpunk anime genre. It’s more challenging as there are pose changes and a fair bit of movement. As you can see img2img is not working as well here but I also only used 3 key frames for a 64 frame long sequence. It’s still interesting.