Temporal Weighting... Or whatever it's called #473

WASasquatch · 2023-04-11T03:12:40Z

WASasquatch
Apr 11, 2023

In A1111, you can do things like [((theEmbded)):0.5] for a strong effect only applied at half weight which balanced the strong weighting of the embed (or whatever you are doing text wise). You could even do [(theEmbed):1.5] for a strong effect that overpowers other embeds a bit so they balance out better (like subject vs style), but in ComfyUI, even one level of weighting causes the embedding to blow out the image (hard color burns, hard contrast, weird chromatic aberration effect).

It doesn't seem the weighting is as versatile in ComfyUI, and maybe could use improvement? A lot of my prompts simply don't work even beyond converting embeds, cause the weighting looks really bad in ComfyUI.

ltdrdata · 2023-04-11T03:49:40Z

ltdrdata
Apr 11, 2023
Collaborator

As far as I know, the weight in ComfyUI is the literal weight of the text, while A1111 normalizes the sum of the weights between prompts to 1.0. I understand that this is an intentional decision.

If weight normalization is necessary, it is necessary to consider creating a new prompt that performs normalization or adding an option to the node to perform normalization of the prompt.

0 replies

comfyanonymous · 2023-04-11T04:36:25Z

comfyanonymous
Apr 11, 2023
Maintainer

It's intentional: https://comfyanonymous.github.io/ComfyUI_examples/faq/

I think this way is much better and I don't like the way a1111 does it because it modifies the whole prompt (using incorrect math) as soon as you add one weight.

3 replies

WASasquatch Apr 11, 2023
Author

Do you have specific examples of why it's better? Because you basically can't weight stuff more than once before the generation is blown out and really bad quality, which in itself, makes the range of normal weighting wrong, because just ((x)) is too strong.

The math may be wrong, but the effect of controlling your prompt in results without it being useless is there. From a quality standpoint in fidelity ComfyUI is lacking because of this. Weighting is basically useless cause you can't specifically control points against others without the image just being bad.

For example, the weighting of embeddings seems wrong, as just two embeddings starts producing bad results at base weights in any model I have, let alone using more then one like a style on top of it.

Even diffusers LPW pipeline uses this format of weighting it seems, and quality isn't an issue with complex weighted prompts unless you're pushing 2.0+ / (((((x))))) level stuff.

And should you really be trying to change the standards used by everyone over a personal choice? Basically millions and millions of prompts don't work in ComfyUI, and we gotta figure out what works, which so far seems limited in scope of functionality of weighting.

To summarize; right now the weighting acts as a +/- emphasis in seemingly one level. It's ether weighted up or down, trying to go higher "breaks the math" and yields a bad image. And that seems wrong, to even the weighting paper.

comfyanonymous Apr 11, 2023
Maintainer

If you try increasing weights on a single item in ComfyUI you'll notice pretty clearly that the effects of the weighting are more localized to that item than on the a1111 UI.

Here's a quick example with (glass bottle:1.0) -> (glass bottle:1.3)

You'll notice that the only thing that really changes is the bottle.

WASasquatch Apr 11, 2023
Author

To me every image is very different, and you can plainly see the hard contrast of the bottle.

Additionally, I feel this misses the point of weighting which is attention to what you're weighting. The glass bottle is more or less the same in every image, as well as the background is the same (which I don't think it is, but the bottle isn't being emphasized)

The bottle should be in fact dramatically changing as you apply more weight to it. It should narrow in the vector basically so it's more "True" (so to speak) to what you're weighting.

What your weighting is doing is wrong.

This is how weighting is supposed to work. This is 1.0 - 1.5. The glass bottle aspect should become more true over the other elements of the prompt, which you can clear see happening in this example form A1111 (they also don't start blowing out until hitting 1.5, but even then is OK):

comfyanonymous · 2023-04-11T07:23:38Z

comfyanonymous
Apr 11, 2023
Maintainer

I should have picked a more complex prompt because that's where the differences become obvious.

If you want it to behave that way you simply have to average the weights of the tokens. I have no plans on implementing this simply because I think it fits better with ComfyUI to have the weights actually match the ones people use in the prompt and I prefer it this way.

Lora strengths don't get averaged out, unCLIP strengths don't get averaged out, Controlnet strengths don't get averaged out so averaging out prompt strengths wouldn't fit at all.

4 replies

WASasquatch Apr 11, 2023
Author

None of this addresses how your weights simply don't work. They don't make what you're prompting more true, aka, emphasizing it. It's more or less just the same thing, until it's terrible quality, which is only like one level in. That's just bad @comfyanonymous. The quality ComfyUI yields with any sort of weighting is subpar. And I don't think the problems stop there. You can't use more then one embedding without a blown out effect, without any weighting, and then you can't control one embed over the other, it's always all or nothing effect, just with bad quality. I think there is a very clear reason why everyone does it this way.

And it's not so much just emulating everything else (though you should follow standards if you hope to gain traction), but it should at least work... in the current state weighting is simply broken. Your example is blown out in the first step of weighting vesus same model with no weighting, or A1111 all the way through to 1.5.

Also with prompts, you're averaging because of a token limit, to get the best result, which is clearly a good reason, as the quality of generations is 100% better. Also with this sort of on/off effect where extra weighting doesn't actually change your image to more match what you're weighting, your setup is ironically more "averaged" with the rest of the prompt lol....

WASasquatch Apr 11, 2023
Author

Great example of why this weighing system is broken is take a embedding for a African race subject. In A1111/Diffusers and customized services based on these or kdiffusion, you can lower the weight of the embedding and do something like mixed with Irish woman and get mixed raced people. In ComfyUI this is impossible. The weighting does nothing for mixing with other subjects/styles, and the quality just gets worse more you weight trying. That's more on the bug side and doesn't match the paper. I can't even mix celebrities which was a popular way to create unique stable people without embeddings/LORAs

It's actually a shame I didn't spend more time creating in ComfyUI over getting excited about a node network and creating such an extensive node setup when it's pretty much fundamentally broken because of personal choices over application with over a billion generations across platforms o.O

comfyanonymous Apr 11, 2023
Maintainer

If you are wondering where my weight algorithm is from it's pretty much copied from invokeai/compel the library that's linked and recommended on the diffusers page for "weighting prompts": https://huggingface.co/docs/diffusers/main/en/using-diffusers/weighted_prompts

This is another reason why I believe the way I did it is the correct way of doing weights.

WASasquatch Apr 11, 2023
Author

If you are wondering where my weight algorithm is from it's pretty much copied from invokeai/compel the library that's linked and recommended on the diffusers page for "weighting prompts": https://huggingface.co/docs/diffusers/main/en/using-diffusers/weighted_prompts

This is another reason why I believe the way I did it is the correct way of doing weights.

FYI, no one uses Diffusers weighting, or really the vanillas pipes in general. LPW is what is used for long prompts and weighting in most applications using it. Look at all the topics arguing with Diffusers trying to change the standards that have already been in place, like long prompts, weighting styles, and how long it took just to get weighting because of their "philosophies". Lol And they caved cause the topics never stopped.

It's also good to point out the point of weighting again, and how it's supposed to narrow the prompt towards the emphesis.

Take glass bottle, now look at the 1.5 dataset data for glass bottle: https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/images?_search=glass+bottle&_sort=rowid

Notice how A1111 heavy weighting more represents the common glass bottle which is corked, or has a miniature scene in it. This shows that the A1111 prompting is more true to the dataset we're targeting too.

The fact ComfyUI keeps the bottle more or less the same minus a little shape, isn't correct. we should be loosing the effect of the rest of the prompt on the bottle, and narrow in on glass bottle data.

When I say Diffusers, I mean it's use in projects and services, not HF Diffusers. They're no one to model off of. Shouldn't take arguments (like here unfortunately) to point out fundamental issues like their lack of weighting to begin with for over half a year, to long prompt support, to ckpt support, to custom clip model support (that took forever for them to refactor classes that should have allowed this from the get go), to stuff as simple as deprecating stuff they randomly remove and stop nuking every service downstream without any sort of documentation, to even insinuating their users are stupid and wouldn't understand stuff, when it's inherently an toolkit for developers o.O

BlenderNeko · 2023-04-11T15:02:40Z

BlenderNeko
Apr 11, 2023

I'm going to have to agree with comfy here. Current implementation is very explicit and clear. It's also not very hard to get a custom node going that does auto normalizing. One thing that might make this slightly ugly is that the current codebase does the parsing/tokenizing and encoding in a single step, so you have to replicate a bit of the backend in order to do this. something my cutoff implementation also runs into, but is overall quite minimal.

6 replies

WASasquatch Apr 11, 2023
Author

I don't agree at all because the implementation is inherently broken. It doesn't produce the right results. You can't hone in on the datasets for 1.5 let alone custom models, you can't weight multiple embeds to control the style effect over overriding a subject. It's simply broken. If you weight something it should adhere to the datasets more closesly. That's the entire point here. None of this is possible in ComfyUI. You get more or less the same thing, just with bad quality as you weight up.

It's fundamentally broken for what it's supposed to be doing. That's just that.

I've already linked the dataset to what we are literally weighting here, and the results are clear from Comfy and A1111. A1111 is correct. It starts adhering to the datasets more closely as you weight up. Nothing about Comfy example shows that glass bottle is being weighted up, and instead the hard contrast white highlights and dark black lines set it (which is a documented problem with weight and blowing it out, which suggests bad math). The bottle doesn't change to represent the data in the dataset.

If you can't wait at all, it should be removed. In current state, one level of weighting ruins the image (color burn, hard contrast, etc, which is a known problem)

WASasquatch Apr 11, 2023
Author

And be sure to look at the dataset, specifically the aesthetic score threshold Stability.AI pruned for, and what would exist in 1.5 (similarity vs aesthetics). You'll quickly notice ComfyUI is not actually narrowing in on what the dataset contains. It should be transforming into a corked bottle, and for the highest scores be a crystal bottle or a bottle with a miniature scene in it. Like my example... There's clearly more to this than arbitrarily asserting the "math is wrong" for whatever everyone is using. Lol

BlenderNeko Apr 11, 2023

here's what A1111 is doing:

        # restoring original mean is likely not correct, but it seems to work well to prevent artifacts that happen otherwise
        batch_multipliers = torch.asarray(batch_multipliers).to(devices.device)
        original_mean = z.mean()
        z = z * batch_multipliers.reshape(batch_multipliers.shape + (1,)).expand(z.shape)
        new_mean = z.mean()
        z = z * (original_mean / new_mean)

        return z

They scale the embeddings based the global mean. Why? what's the principle behind it? what if the mean doesn't change much but the std does? As far as I can tell this is just a "we don't know what we're doing but it looks okay" thing. If this is indeed the case I'd rather not have something like that on master.

Like I said, you have to replicate a bit of the backend to actually do this in a custom node, so maybe there is an argument there to break this up into a couple of chunks so implementing a custom node becomes easier. But we're not really talking about a lot of code here.

BlenderNeko Apr 11, 2023

Now on the topic of actual prompt swapping (the temporal stuff), you can currently achieve this using the advanced Ksampler by chaining them together with different prompts and start/end steps. I have to admit that this is somewhat clunky but I can't really come up with a good node based alternative workflow for it.

WASasquatch Apr 11, 2023
Author

here's what A1111 is doing:
        # restoring original mean is likely not correct, but it seems to work well to prevent artifacts that happen otherwise
        batch_multipliers = torch.asarray(batch_multipliers).to(devices.device)
        original_mean = z.mean()
        z = z * batch_multipliers.reshape(batch_multipliers.shape + (1,)).expand(z.shape)
        new_mean = z.mean()
        z = z * (original_mean / new_mean)

        return z
They scale the embeddings based the global mean. Why? what's the principle behind it? what if the mean doesn't change much but the std does? As far as I can tell this is just a "we don't know what we're doing but it looks okay" thing. If this is indeed the case I'd rather not have something like that on master.

Like I said, you have to replicate a bit of the backend to actually do this in a custom node, so maybe there is an argument there to break this up into a couple of chunks so implementing a custom node becomes easier. But we're not really talking about a lot of code here.

No offense, but that sounds like what's happening in ComfyUI, because the end results are just absolutely terrible, and don't even work correctly. As I've iterated sooo many times.

Many programs will approximate or take shortcomings based on the fidelity of the result and expected input, because quality trumps weird knit-pickings, as that's the whole point here.

The comment code is literally explaining the problem that ComfyUI is producing, too, which is ironic as hell.

WASasquatch · 2023-04-11T19:52:50Z

WASasquatch
Apr 11, 2023
Author

I don't understand how you can say things are just fine when you can't do crap with weighting. Can't use more then one embed if they have high vectors as those take tokens, and can't underweight a embed to lessen the effect over others. You can't even control a normal prompt by making a bottle, or anything more emphasized over the rest of the prompt, making it essentially useless, even before the fact the weighting is causing bad image results.

You're really fine with no one being able to use ComfyUI like everyone is used to, cause you have a personal opinion? Why not make that your personal builds functionality.

Really I was wondering why ComfyUI wasn't as popular considering it's power, but if it produces bad images, you really can't get by that.

Now that I can do what I want for professional workflows with a huge suite for post production, I find ComfyUI is useless for any of that. I have been doing digital art for over 25 years and there is no denying ComfyUI is producing terrible results with any sort of weighting or strong embeds.

0 replies

comfyanonymous · 2023-04-11T21:54:00Z

comfyanonymous
Apr 11, 2023
Maintainer

Here's a more proper example of why the weights in ComfyUI are very good:

These go from (angry:1.0) to (angry:1.4)

Now while she's this angry I want to add a planet in the back so I increase (planet:1.2) to (planet:1.3)

Changing standing to sitting still works well because the tokens have not been crushed:

But I want her to sit better so I increase sitting from 1.0 to 1.3:

And now I give her a longer skirt to hide her legs so I do (long skirt:1.1) to (long skirt:1.2):

The prompt strengths in ComfyUI are behaving exactly as I expect them and I have no reason to change them to the completely broken way A1111 does them.

1 reply

WASasquatch Apr 11, 2023
Author

I see a lot of discrepancy between the whole image and what you are targeting in weights. Also try using emphasis weighting ((()))

The same thing is visible in your original example where you think the background isn't changing, but is infact changing dramatically compared to A1111 (where in A1111 the subject being weighted is ONLY changing in a deterministic way, as this should all work)

That's lack of control imo in a same seed image.

miasik · 2023-04-12T05:04:39Z

miasik
Apr 12, 2023

I don't see any ultimate disadvantage for any of the formulas. Instead, as I see, both of the formulas have reasons to be used.
Is there a way to move the current calculation(CUI way) to a new module and then add this module to the default workflow?
It would give to end-users the ability to write and use their own calculation for their tasks.

6 replies

WASasquatch Apr 12, 2023
Author

It's weird it's even turned into an argument to want something so trivial in implementation apparently according to prior posts even before it went nuts. And then to stick to the guns on something there imo inherently doesn't follow the papers is weird. People may like it but man...

Been doing art a long time, ran art communities, done big projects and know many visual artists we all admire that make our movies happen. I know good art, and something usable for professional workflows, and I know what won't work from the get go let alone to fuss with it in post.

This just isn't it. Not for film processing, note even interpolation animation for fun without it being total frame mutation like in archaic AI stuff like deep image. And then if you really want to permeate an idea with weighting you suffer total meltdown of fidelity way to early.

Darkflo264 Apr 12, 2023

Hi all! I might not have a right to interfere here due to lack of deep understanding of mechanics, but i will do it anyways.

@WASasquatch, @comfyanonymous , first off, both of you should get a coffe, beer or a glass of whiskey and chill a bit.

I dont think a "my sight on things is the correct way" behavior really fits here.
As i understand, the weighting of ComfyUI behaves more like InvokeAI instead of Auto1111. (correct me if i am wrong)

I tested a bit, and i, personally, can see advantages and disadvantages of BOTH ways how weights are beeing processed.
I don´t want to offend any of you!

@comfyanonymous,
I really love your UI. I see it as a perfect way to handle a lot of things more flexible then other UIs. You have implemented a few of my Feature requests, so i do know you are not "ignoring what community wants".

@WASasquatch, i can understand you as well. No offense, but you trying to force comfy to change the weighting is the same as he saying he won´t change it :)
Your work with custom nodes is fantastic. I really love them, and you seem to have a deep understanding of how to implement some.

So my opinion about that whole disussion is, why even clash about "the correct way"?
You BOTH are doing fantastic work, and i do know it might not be solved with a simple custom node. (i need to repeat, i would love to support in any way, but neither do i have a deep understanding of stable diffusion or pyton. sadly.)
But why not work together and implement a function, nodes, whatever to use BOTH methods. Based on needs, personal preference.

It might not be done with a simple line of code but i don´t think it´s impossible? At least not with the knowledge of both of you experts.

As i have understand, other UIs are using the one OR the other method. Wouldn´t a way to use BOTH methods in ComfyUI give this UI a (another) immense advantage against other UIs?

I really love this UI and i would love to see a way to use BOTH methods flexible.

If you both, or at least one of you is interested to have a closer look at this, i am happy to offer my support about whatever i can help with.

BlenderNeko Apr 12, 2023

The "normalization" happening in automatic isn't some magic sauce that suddenly makes changes to the prompt have a predictable, deterministic and local effect on the generated output. If anything it is liable to being less predictable since it moves the entire embedding. That the images generated by comfy start looking bad if you give things large weights isn't surprising as you're likely just pushing vectors towards magnitudes SD simply doesn't expect to get. @WASasquatch If you want I can spend some time this weekend or something to play around and cobble up a node that just tried a variety of ways to reign in the weighted embeddings and/or give you one that just does the same thing A1111 does.

WASasquatch Apr 12, 2023
Author

The "normalization" happening in automatic isn't some magic sauce that suddenly makes changes to the prompt have a predictable, deterministic and local effect on the generated output. If anything it is liable to being less predictable since it moves the entire embedding. That the images generated by comfy start looking bad if you give things large weights isn't surprising as you're likely just pushing vectors towards magnitudes SD simply doesn't expect to get. @WASasquatch If you want I can spend some time this weekend or something to play around and cobble up a node that just tried a variety of ways to reign in the weighted embeddings and/or give you one that just does the same thing A1111 does.

Just willy nilly saying stuff that isn't true in application is negligence. Where would it be coming from if that's all the calculations on the encoded prompt tokens? And that is the result in functionality?

If you want to sure

comfyanonymous Apr 13, 2023
Maintainer

I won't implement the a1111 algorithm in master but I'll see if I can make it easier to tune the weight algorithm from custom nodes since that might actually be useful for many things not just implementing a1111 weights.

missionfloyd · 2023-04-12T09:23:48Z

missionfloyd
Apr 12, 2023

I took a stab at a custom prompt node that runs it through this first.

prompt_a1111.py

import re

re_attention = re.compile(r"""
\\\(|
\\\)|
\\\[|
\\]|
\\\\|
\\|
\(|
\[|
:([+-]?[.\d]+)\)|
\)|
]|
[^\\()\[\]:]+|
:
""", re.X)

re_break = re.compile(r"\s*\bBREAK\b\s*", re.S)

def parse_prompt_attention(text):
    """
    Parses a string with attention tokens and returns a list of pairs: text and its associated weight.
    Accepted tokens are:
      (abc) - increases attention to abc by a multiplier of 1.1
      (abc:3.12) - increases attention to abc by a multiplier of 3.12
      [abc] - decreases attention to abc by a multiplier of 1.1
      \( - literal character '('
      \[ - literal character '['
      \) - literal character ')'
      \] - literal character ']'
      \\ - literal character '\'
      anything else - just text
    >>> parse_prompt_attention('normal text')
    [['normal text', 1.0]]
    >>> parse_prompt_attention('an (important) word')
    [['an ', 1.0], ['important', 1.1], [' word', 1.0]]
    >>> parse_prompt_attention('(unbalanced')
    [['unbalanced', 1.1]]
    >>> parse_prompt_attention('\(literal\]')
    [['(literal]', 1.0]]
    >>> parse_prompt_attention('(unnecessary)(parens)')
    [['unnecessaryparens', 1.1]]
    >>> parse_prompt_attention('a (((house:1.3)) [on] a (hill:0.5), sun, (((sky))).')
    [['a ', 1.0],
     ['house', 1.5730000000000004],
     [' ', 1.1],
     ['on', 1.0],
     [' a ', 1.1],
     ['hill', 0.55],
     [', sun, ', 1.1],
     ['sky', 1.4641000000000006],
     ['.', 1.1]]
    """

    res = []
    round_brackets = []
    square_brackets = []

    round_bracket_multiplier = 1.1
    square_bracket_multiplier = 1 / 1.1

    def multiply_range(start_position, multiplier):
        for p in range(start_position, len(res)):
            res[p][1] *= multiplier

    for m in re_attention.finditer(text):
        text = m.group(0)
        weight = m.group(1)

        if text.startswith('\\'):
            res.append([text[1:], 1.0])
        elif text == '(':
            round_brackets.append(len(res))
        elif text == '[':
            square_brackets.append(len(res))
        elif weight is not None and len(round_brackets) > 0:
            multiply_range(round_brackets.pop(), float(weight))
        elif text == ')' and len(round_brackets) > 0:
            multiply_range(round_brackets.pop(), round_bracket_multiplier)
        elif text == ']' and len(square_brackets) > 0:
            multiply_range(square_brackets.pop(), square_bracket_multiplier)
        else:
            parts = re.split(re_break, text)
            for i, part in enumerate(parts):
                if i > 0:
                    res.append(["BREAK", -1])
                res.append([part, 1.0])

    for pos in round_brackets:
        multiply_range(pos, round_bracket_multiplier)

    for pos in square_brackets:
        multiply_range(pos, square_bracket_multiplier)

    if len(res) == 0:
        res = [["", 1.0]]

    # merge runs of identical weights
    i = 0
    while i + 1 < len(res):
        if res[i][1] == res[i + 1][1]:
            res[i][0] += res[i + 1][0]
            res.pop(i + 1)
        else:
            i += 1

    return ", ".join([f"({i[0]}:{i[1]})" for i in res])

class CLIPTextEncodeA1111:
    @classmethod
    def INPUT_TYPES(s):
        return {"required": {"text": ("STRING", {"multiline": True}), "clip": ("CLIP", )}}
    RETURN_TYPES = ("CONDITIONING",)
    FUNCTION = "encode"

    CATEGORY = "conditioning"

    def encode(self, clip, text):
        text = parse_prompt_attention(text)
        print(text)
        return ([[clip.encode(text), {}]], )

NODE_CLASS_MAPPINGS = {
    "CLIPTextEncodeA1111": CLIPTextEncodeA1111
}

# A dictionary that contains the friendly/humanly readable titles for the nodes
NODE_DISPLAY_NAME_MAPPINGS = {
    "CLIPTextEncodeA1111": "CLIP Text Encode (Auto1111)"
}

Default prompt node

This

6 replies

maarek Apr 13, 2023

This only breaks apart the prompts, applies the extra parens weight, and misses the averaging that A1111 does to the tensor. I have to look more into the tokenize that's occurring downstream and the text parses a little wonky with that parser. I've been mucking around with it some and have modified it to the following. Knowing that it'd only apply an average weighting across the prompts and leaving it up to base parser for the rest of course.

split_prompt_encoders.py

import re
import statistics


class CLIPText:

    @classmethod
    def INPUT_TYPES(s):
        return {"required": {"text": ("STRING", {"multiline": True}), "clip": ("CLIP", )}}
    RETURN_TYPES = ("CLIPTEXT", )
    FUNCTION = "init_prompt"

    CATEGORY = "conditioning"

    def init_prompt(self, clip, text):
        return ({
            "clip" : clip,
            "text" : text,
        }, )

class BaseWeightEncoder:

    @classmethod
    def INPUT_TYPES(s):
        return {"required": {"clip_text": ("CLIPTEXT", )}}
    RETURN_TYPES = ("CONDITIONING", )
    FUNCTION = "encode"

    CATEGORY = "conditioning/encoder"

    def encode(self, clip_text):
        return ([[clip_text["clip"].encode(clip_text["text"]), {}]], )

class MeanWeightEncoder:
    
    @classmethod
    def INPUT_TYPES(s):
        return {"required": {"clip_text": ("CLIPTEXT", ), 
                              "factor": ("FLOAT", {"default": 1.0, "min": 0.001, "max": 2.0, "step": 0.01})}}
    RETURN_TYPES = ("CONDITIONING", )
    FUNCTION = "encode"

    CATEGORY = "conditioning/encoder"

    def encode(self, clip_text, factor):
        text = self.process(clip_text["text"], factor)
        print(f"attention: {text}")
        return ([[clip_text["clip"].encode(text), {}]], )

    re_attention = re.compile(r"""
    \\\(|
    \\\)|
    \\\[|
    \\]|
    \\\\|
    \\|
    \(|
    \[|
    :([+-]?[.\d]+)\)|
    \)|
    ]|
    [^\\()\[\]:]+|
    :
    """, re.X)

    re_break = re.compile(r"\s*\bBREAK\b\s*", re.S)

    def parse_prompt_attention(self, text):
        """
        Parses a string with attention tokens and returns a list of pairs: text and its associated weight.
        Accepted tokens are:
        (abc) - increases attention to abc by a multiplier of 1.1
        (abc:3.12) - increases attention to abc by a multiplier of 3.12
        [abc] - decreases attention to abc by a multiplier of 1.1
        \( - literal character '('
        \[ - literal character '['
        \) - literal character ')'
        \] - literal character ']'
        \\ - literal character '\'
        anything else - just text
        >>> parse_prompt_attention('normal text')
        [['normal text', 1.0]]
        >>> parse_prompt_attention('an (important) word')
        [['an ', 1.0], ['important', 1.1], [' word', 1.0]]
        >>> parse_prompt_attention('(unbalanced')
        [['unbalanced', 1.1]]
        >>> parse_prompt_attention('\(literal\]')
        [['(literal]', 1.0]]
        >>> parse_prompt_attention('(unnecessary)(parens)')
        [['unnecessaryparens', 1.1]]
        >>> parse_prompt_attention('a (((house:1.3)) [on] a (hill:0.5), sun, (((sky))).')
        [['a ', 1.0],
        ['house', 1.5730000000000004],
        [' ', 1.1],
        ['on', 1.0],
        [' a ', 1.1],
        ['hill', 0.55],
        [', sun, ', 1.1],
        ['sky', 1.4641000000000006],
        ['.', 1.1]]
        """

        res = []
        mult = []
        round_brackets = []
        square_brackets = []

        round_bracket_multiplier = 1.1
        square_bracket_multiplier = 1 / 1.1

        def multiply_range(start_position, multiplier):
            for p in range(start_position, len(res)):
                mult[p] *= multiplier

        for m in self.re_attention.finditer(text):
            text = m.group(0)
            weight = m.group(1)

            if text.startswith('\\'):
                res.append(text[1:])
                mult.append(1.0)
            elif text == '(':
                round_brackets.append(len(res))
            elif text == '[':
                square_brackets.append(len(res))
            elif weight is not None and len(round_brackets) > 0:
                multiply_range(round_brackets.pop(), float(weight))
            elif text == ')' and len(round_brackets) > 0:
                multiply_range(round_brackets.pop(), round_bracket_multiplier)
            elif text == ']' and len(square_brackets) > 0:
                multiply_range(square_brackets.pop(), square_bracket_multiplier)
            else:
                parts = re.split(self.re_break, text)
                for i, part in enumerate(parts):
                    if i > 0:
                        res.append("BREAK")
                        mult.append(-1)
                    res.append(part)
                    mult.append(1.0)

        for pos in round_brackets:
            multiply_range(pos, round_bracket_multiplier)

        for pos in square_brackets:
            multiply_range(pos, square_bracket_multiplier)

        if len(res) == 0:
            res = [""]
            mult = [1.0]

        # merge runs of identical weights
        i = 0
        while i + 1 < len(res):
            if mult[i] == mult[i + 1]:
                res[i] += res[i + 1]
                mult[i] = mult[i + 1]
                res.pop(i + 1)
                mult.pop(i + 1)
            else:
                i += 1

        return (res, mult)

    def process(self, text, factor):
        def mean_norm(ds, factor):
            if isinstance(ds, list):
                norm_list = list()
                min_value = min(ds)
                max_value = max(ds)
                mean = statistics.mean(ds)
                print(f"min: {min_value} max: {max_value} mean: {mean} factor: {factor}")


                for value in ds:
                    if max_value - min_value == 0:
                        norm_list.append(mean)
                        continue

                    tmp = (value - mean) / (max_value - min_value) + mean
                    norm_list.append(tmp*factor)
            
            return norm_list

        (tokens, multipliers) = self.parse_prompt_attention(text)
        norm = mean_norm(multipliers, factor)
        print(f"parsed_line: {tokens} : {multipliers}")
        print(f"mean_norm: {norm}")

        return ",".join([f"({t}:{m})" for (t,m) in zip(tokens, norm)])


NODE_CLASS_MAPPINGS = {
    "CLIPText": CLIPText,
    "BaseWeightEncoder": BaseWeightEncoder,
    "MeanWeightEncoder": MeanWeightEncoder
}

# A dictionary that contains the friendly/humanly readable titles for the nodes
NODE_DISPLAY_NAME_MAPPINGS = {
    "CLIPText": "CLIP Text (Prompt)",
    "BaseWeightEncoder": "Encode Base Weights",
    "MeanWeightEncoder": "Encode Mean Weights"
}

szhublox Apr 13, 2023

I hope comfyanonymous is watching. There is no reason not to do these things, especially with community help. All that's missing is the auto/nai ENSD setting in a custom sampler node. Hey, WASasquatch, come back, this isn't far from solved.

maarek Apr 13, 2023

I think it's a fair bit more complicated than this trivial implementation. This only applies an average weight across the prompt only (with an adjustable factor as it'll still blow out with high values as WAS states) and passes it down stream for the base encoder to do its thing. Ideally, I think there would be a means of integrating the math to be applied at the tensor, against all integrated weights?

WASasquatch Apr 13, 2023
Author

I think it's a fair bit more complicated than this trivial implementation. This only applies an average weight across the prompt only (with an adjustable factor as it'll still blow out with high values as WAS states) and passes it down stream for the base encoder to do its thing. Ideally, I think there would be a means of integrating the math to be applied at the tensor, against all integrated weights?

Though, ironically the mean weights seem to represent the dataset better if you look up masterpiece painting and quality at least. Which is funny cause actual like realistic traditional portraits like kittykatt are strangely lower quality then stuff like what's pictured in the lower example, that's somehow of better aesthetic score.

missionfloyd Apr 13, 2023

This only breaks apart the prompts, applies the extra parens weight, and misses the averaging that A1111 does to the tensor

Yeah, I figured there was probably more to it than that.

LEv145 · 2023-04-12T13:40:57Z

LEv145
Apr 12, 2023

I tested the tags a bit on auto1111 and comfy with the same inputs

ComfyUI:
Prompts like in auto: image image image
With (cyborg:1.28), (cyborg:1.30), (cyborg:1.32), (cyborg:1.34) (approximate result after auto1111 normalization. I'm not a mathematician, measured by eye): image image image

workflow with plugin

Auto1111:
image image image

Inputs

VAE: None
Model: Deliberate-v2
Model hash: 9aba26abdf
Positive prompt: a cute kitten made out of metal, (tail:1.1), (intricate details), hdr, (intricate details, hyperdetailed:1.1), cinematic shot, vignette, centered, (cyborg:1.2)
+ Prompt S/R: (cyborg:1.2), (cyborg:1.3), (cyborg:1.4), (cyborg:1.5)
Negative prompt: (deformed, distorted, disfigured:1.3), poorly drawn, bad anatomy, wrong anatomy, extra limb, missing limb, floating limbs, (mutated hands and fingers:1.4), disconnected limbs, mutation, mutated, ugly, disgusting, blurry, amputation, flowers, human, man, woman
Steps: 24
Sampler: Euler
Size: 512x512
ENSD: 31337

22 replies

WASasquatch Apr 13, 2023
Author

@WASasquatch In auto1111, tags are normalized, so the distance between 1.3 and 1.4 is actually not 0.1, but much smaller, for example: 0.02 If you take the same step in ComfyUI, there will be no difference I don't see any difference in changing cfg between auto1111 and ComfyUI, except for seed. Do you have any examples?

It's normalized. So it could be very different prompt to prompt. I never said there was a difference in CFG between A1111, or ComfyUI, this is in regard to changing diffusion values and expecting it to be deterministic like a previous diffusion result.

It appears that we may need more experimental data in order to make a conclusive determination.

It would be beneficial to test with various prompt conditions to get a better understanding of the situation.

I believe that having experimental results would provide clear decision-making guidance for the policy.

Yeah. And also why I keep trying to get people to pay attention to the datasets at hand (LAION aesthetics for example in the case of the original prompt) that would relate to what you're narrowing your vectors on. In the case of the glass bottle it's clear A1111 is more adhering to the dataset, where ComfyUI example may not even be coming from the model (at least in regard to it'
s top), but ViTL14 or w/e CLIP model was used (probably ViTL14 if from the model itself, didn't pay attention to settings there when extracting the prompt and seed and stuff.

Additionally, if for some reason in ComfyUI it's leaning more on CLIP when you end up weighting, that could definitely explain the loss of quality as ViTL14 is subpar in quality at like 334x334 resolution training data

LEv145 Apr 13, 2023

@WASasquatch
I don't see ComfyUI doing worse than auto1111 with a small tag change

Can I have a visual example with ViTL14? I do not understand
What input data was used?

WASasquatch Apr 13, 2023
Author

@WASasquatch I don't see ComfyUI doing worse than auto1111 with a small tag change

The hard contrast white highlighting and black complement highlighting (shadow highlights basically) that's visible from the get go in Comfy's example, vs the much softer image (that's easier to work with in post, and apply your own sharpening). This effect happens very often in ComfyUI when weighting, and often with just a little weighting.

Can I have a visual example with ViTL14? I do not understand What input data was used?

The dataset doesn't really contain what Comfy's result came up with, like one or 2 hits in the entire thing that even remotely resemble Comfy's result, so the result may have been more generalized from CLIP augmentation then the 1.5 model. CLIP is 334x334 training data. Naturally that won't look as good at 512x512.

comfyanonymous Apr 13, 2023
Maintainer

The dataset doesn't really contain what Comfy's result came up with, like one or 2 hits in the entire thing that even remotely resemble Comfy's result, so the result may have been more generalized from CLIP augmentation then the 1.5 model. CLIP is 334x334 training data. Naturally that won't look as good at 512x512.

This is completely wrong.

You can dislike the weighting algorithm all you want but should should stop spreading misinformation. You have a severe misunderstanding of how Stable Diffusion works and almost everything you say about how it works is completely wrong which makes it impossible to have an actual discussion.

LEv145 Apr 13, 2023

@WASasquatch
Can you provide an example(workflow) of this effect in comfyUI to reproduce it?

catboxanon · 2023-04-12T16:36:03Z

catboxanon
Apr 12, 2023

I'd just like to point out what you mention about A1111's syntax in the OP doesn't make sense and you may have a misunderstanding of the syntax. Either that or it's just the way you described it doesn't make sense to me.

you can do things like [((theEmbded)):0.5] for a strong effect only applied at half weight

Do you mean only applying for half of the generation time? Because that syntax is utilizing prompt editing for the brackets. https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#prompt-editing
In this case, it is applying emphasis by a factor of 1.21 and then scheduling it to only be ran for the latter half of the inference steps.

You could even do [(theEmbed):1.5] for a strong effect that overpowers other embeds a bit

Again, this is utilizing the prompt editing syntax. You are first applying emphasis to theEmbed and then scheduling it to be applied only after the very first step (the parser interprets numbers >= 1 to be treated as steps and are casted to int).

I'm only sharing this because from the discussion title I was expecting it to be about this feature, but instead got a lot of interesting discussion on weighting in general, with none of it related to being temporal. @BlenderNeko actually answered the initial question but I assume it just got glossed over once it was mentioned weighting is implemented differently.

2 replies

comfyanonymous Apr 12, 2023
Maintainer

Oh I see, he was talking about prompt scheduling not just about the regular weighting. I'm not that familiar with a1111 advanced prompt syntax since I probably tried it only once and the 1.5 in [(theEmbed):1.5] made me think it was about regular weights.

You can do that in ComfyUI using KSamplerAdvanced to split sampling into different parts so you can apply different prompts (and models, etc..) to different steps.

WASasquatch Apr 12, 2023
Author

I'd just like to point out what you mention about A1111's syntax in the OP doesn't make sense and you may have a misunderstanding of the syntax. Either that or it's just the way you described it doesn't make sense to me.

you can do things like [((theEmbded)):0.5] for a strong effect only applied at half weight

Do you mean only applying for half of the generation time? Because that syntax is utilizing prompt editing for the brackets. https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#prompt-editing In this case, it is applying emphasis by a factor of 1.21 and then scheduling it to only be ran for the latter half of the inference steps.

You could even do [(theEmbed):1.5] for a strong effect that overpowers other embeds a bit

Again, this is utilizing the prompt editing syntax. You are first applying emphasis to theEmbed and then scheduling it to be applied only after the very first step (the parser interprets numbers >= 1 to be treated as steps and are casted to int).

I'm only sharing this because from the discussion title I was expecting it to be about this feature, but instead got a lot of interesting discussion on weighting in general, with none of it related to being temporal. @BlenderNeko actually answered the initial question but I assume it just got glossed over once it was mentioned weighting is implemented differently.

It's a whole host of problems. Like how two embeds will consume all your tokens because there is no averaging, and won't respond to your prompt well, then you can't weight the embeds to lower token threshold being used, saying lowering effect of a Picasso style on a subject person. The image immediately blows out. Then weighting in general where A1111 prompts, and even prompts from online don't provide results of the same quality of adherence.

miasik · 2023-04-13T06:26:47Z

miasik
Apr 13, 2023

As for me, there's at least one major reason to have a node with "a1111 calculation" in CUI. It's the ability to share prompts with the community, my local CUI and a1111.

1 reply

WASasquatch Apr 13, 2023
Author

That's a good point. I deleted my ComfyUI images off CivitAI cause prompt stuff wasn't read so everyone kept asking, and I shared but everyone was using webuii and couldn't reproduce.

ltdrdata · 2023-04-13T07:30:35Z

ltdrdata
Apr 13, 2023
Collaborator

I believe that while the a1111 approach has advantages, its effectiveness is unclear, much like a mysterious sauce. Although the outcomes can sometimes be favorable, there is a risk of unintended consequences that imply we may not have complete control over the prompt. As a result, rather than simply copying a1111 when upgrading the prompt, we should aim to find ways to explicitly control the effects of the prompt, allowing users to understand and manage them.

I believe that even if the results turn out to be incorrect, users should be able to recognize which prompt caused the error. Therefore, it is important to ensure that the prompt allows for user recognition of the prompt's impact, even if it leads to undesirable outcomes.

2 replies

WASasquatch Apr 13, 2023
Author

I believe that while the a1111 approach has advantages, its effectiveness is unclear, much like a mysterious sauce. Although the outcomes can sometimes be favorable, there is a risk of unintended consequences that imply we may not have complete control over the prompt. As a result, rather than simply copying a1111 when upgrading the prompt, we should aim to find ways to explicitly control the effects of the prompt, allowing users to understand and manage them.

I believe that even if the results turn out to be incorrect, users should be able to recognize which prompt caused the error. Therefore, it is important to ensure that the prompt allows for user recognition of the prompt's impact, even if it leads to undesirable outcomes.

I don't think that's true. In a1111, as I've already demonstrated, you can specifically target the subject you are weighting, while preserving the determinism of the background, unlike ComfyUI where same prompt and seed yielded a new background while only changing a subject (glass bottles) weight. That's a lack of control on ComfyUI if you know anything about post production and effects and stuff, and now this emerging field of doing this stuff with diffusion, or within it's pipeline. Such as starting with a lesser weight of, I dunno, the eyes, glowing, and then incrementing the eyes to some desired weight that produces a strong cool glow effect for the scenes frames you're doing. You need to preserve everything else as much as possible or else the effect will be really bad in the end result. Can't have the whole image dramatically changing with the same seed cause you're targeting one aspect of it, and if it does, it needs to be as minimal as possible like in my example above and mainly on the noise level, not construction and composition. And there is enough chaos in this process to, at least imo, be using methods that introduce more randomness when the point of the papers is narrowing in on editable determinism to do whatever you want with text2img or img2img. .

Finding the noise for the frames is no issue to implement, there area pretty much pre-packaged functions for that, but I don't understand the math stuff to implement the raw stuff for A1111 with existing ComfyUI, least not through a bunch of trial an error that doesn't seem like fun for something I feel should just be in ComfyUI to be on par with the larger portion of the community, like with what was mentioned above, and ability to share with all the communities based around A1111, which no denying it, is the majority of these websites/discords over anything else.

ltdrdata Apr 13, 2023
Collaborator

Controlling the prompt does not mean controlling the output. For example, if a normalization function such as a1111 balancing is applied, it is important to clearly understand and apply that the sum weight will be balanced within a certain range in the form of ((a:1.4, b:1.3)=1.5, c:1.1)=1.

BlenderNeko · 2023-04-13T10:13:36Z

BlenderNeko
Apr 13, 2023

I am fairly certain now that the changes in behavior are not due to the presence or absence of this re-normalization step. Taking the prompt from @LEv145 with cyborg at 1.5, normalizing according to the code snippet I posted earlier results in very minimal changes to both the CLIP embedding and the output. I've taken the liberty to just save the embeddings A1111 creates before and after this normalization step and use them in ComfyUI. Again, changes of before and after normalization are minimal at best, and the results from the unnormalized embeddings taken from A1111 used in comfy look just like the normalized embeddings look in A1111.

Thus the difference is somewhere in how the CLIP embeddings are created or how the weights are applied, but they do not lie in this normalization step. I will try and investigate this further if/when I have time.

4 replies

BlenderNeko Apr 13, 2023

I now have reasons to belief ComfyUI handles token weights incorrectly and filed an issue at #499

BlenderNeko Apr 13, 2023

Okay so here is the major difference between A1111, and ComfyUI and this is not a bug but rather a design choice.

A1111 just scales the CLIP vector associated with a certain token.

What comfy does is slightly more sophisticated, it computes the CLIP output of a completely empty string (i.e. start token followed by masking tokens, followed by end token), then for every token we can now compute a direction by taking the difference from the unweighted embedding and the empty embedding. This difference is what is scaled in ComfyUI. So my observation in the above linked issue that the vector norms no longer match the token weights is correct, but ComfyUI does not intend for this to be the case. Hence my bug report is actually intended behavior..

This difference in dealing with token weights is why A1111 and ComfyUI produce such different outputs at high token weights. (and not the normalization).

On a conceptual level both of these approaches seem sensible to me. Though maybe there is an argument to be made that the ComfyUI approach would actually need to run n masked token strings through CLIP (where n is the number of tokens) to get a more accurate and faithful directions to scale in? Something I might try out in the near future.

comfyanonymous Apr 13, 2023
Maintainer

Yes, the weighting algorithm is mostly taken from invokeai/compel: https://github.com/damian0815/compel/blob/main/src/compel/embeddings_provider.py#L314

It made more sense to me that since the empty embedding isn't zero it made more sense to lerp them from there instead of zero. When I tested it myself this approach gave me some better results which is why I used it.

lerping from a CLIP output where the token was masked instead would probably give better results but I wanted to keep the weight algorithm simple so I didn't do that.

BlenderNeko Apr 13, 2023

I probably know enough now to get a custom node going that would allow users to pick between different flavors

rethink-studios · 2023-04-13T14:17:47Z

rethink-studios
Apr 13, 2023

(( begins slow clap @ BlenderNeko ))

…

On Thu, Apr 13, 2023 at 9:15 AM BlenderNeko ***@***.***> wrote: I probably know enough now to get a custom node going that would allow users to pick between different flavors — Reply to this email directly, view it on GitHub <https://mailtrack.io/trace/link/0cbd69c243491e73c4be7436bce896f3310ef0c4?url=https%3A%2F%2Fgithub.com%2Fcomfyanonymous%2FComfyUI%2Fdiscussions%2F473%23discussioncomment-5605098&userId=656862&signature=304cbbac0afaf7e8>, or unsubscribe <https://mailtrack.io/trace/link/aa924913a502b15ee94d72f1ba751a76fd4d8dc9?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAOKR6W4TD5WGJBGCJQRGRU3XBAC7PANCNFSM6AAAAAAWZWCHNI&userId=656862&signature=b0a32dbfa4b2ea09> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.*** com>

-- Brian Bullock Founder / Director of CGI RETHINK Studios https://www.rethinkstudios.tv/ <https://mailtrack.io/trace/link/fea7395a2ffcf25b192783b0f25ffdb016918d4c?url=https%3A%2F%2Fwww.rethinkstudios.tv%2F&userId=656862&signature=6315686983811c34> 708-289-1059 ***@***.*** https://ai.rethinkstudios.tv/ <https://mailtrack.io/trace/link/3b60ea1c104910ea44cdd2946b5e3fe47ebeee60?url=https%3A%2F%2Fai.rethinkstudios.tv%2F&userId=656862&signature=b09881a48604ec8a> [image: linkedin] <https://mailtrack.io/trace/link/7fca901fd18f21e7f37bc5727df875ae046d33bb?url=https%3A%2F%2Fwww.linkedin.com%2Fin%2Fbrian-bullock-9aa0512%2F&userId=656862&signature=53ef39824c79c5a2> [image: instagram] <https://mailtrack.io/trace/link/dc89ade4e874f25d1f2fd318c471032bb12f6232?url=https%3A%2F%2Fwww.instagram.com%2Fbrian.bullock.796%2F&userId=656862&signature=9e20c91903dd2297>

0 replies

BlenderNeko · 2023-04-13T22:28:23Z

BlenderNeko
Apr 13, 2023

Alright, I made an AdvancedClipTextEncode node that allows for a bunch more options (excuse the current mess that is the readme, i'll work on that some more later). Right now this node relies on some changes made in the fork you can find here. I have not done extensive testing on all of this, but maybe somebody in here is interested in doing that.

8 replies

szhublox Apr 14, 2023

Made a couple "grids" to show your advanced node. Left-to-right, attention_method goes from comfy->comfy++->A1111. Top row is token_normalization and renorm_method both "none". Bottom row is length+mean and A1111. Between the two grids all that changes is the attention to "glass bottle".

beautiful scenery nature (glass bottle:1.2) landscape, , purple galaxy bottle,

beautiful scenery nature (glass bottle:1.5) landscape, , purple galaxy bottle,

BlenderNeko Apr 14, 2023

From my very brief exploration, the comfy method feels very "aggressive" in its attention, and comfy++ is a lot milder, followed by A1111. I didn't think there would be such a marked difference between traveling from the completely empty embedding and traveling from a locally masked one, but apparently there is. So far comfy++ feels like a middle ground between comfy and A1111, but I don't think I've seen enough data to confidently say that.

I still don't think renorm does much (also the A1111 version of it is mathematically speaking super weird and unprincipled).

I suspect that the length based token weight normalization is probably nice for large embeddings. though I'm also thinking of changing the math behind it. Simply divvying up the weight substantially reduces the norm of large embeddings, might be more principled to go for something slightly more complex that keeps the norm constant for any number of tokens per word.

szhublox Apr 14, 2023

Having the freedom to play around easily like this seems ideal but it doesn't seem safe to push your branch straight to master immediately because it breaks other custom conditioning nodes, for example WAS TextToCondition: AttributeError: 'str' object has no attribute 'shape'. I don't have a good answer but I like what you're doing

BlenderNeko Apr 14, 2023

Good catch, There are a couple of things I can do so it doesn't break other custom nodes

Gorislav Apr 17, 2023

May you make a pull request to the main repository? }:)

morphles · 2023-04-15T07:37:59Z

morphles
Apr 15, 2023

I'll just say, even with all the banter seems you guys are going for amazing result, if it will be possible to choose whichever attention/token handling way one wants it just expands possibilities. And likely even understanding. Huge thanks for the work!

0 replies

mariaWitch · 2023-04-15T15:59:21Z

mariaWitch
Apr 15, 2023

Going to jump in on this, WASasquatch is 100% right. In ALL of these images there is SIGNIFICANT burning caused when any weight is applied to any part of the prompt. The mere fact that this is the case, makes ComfyUI an almost non-starter for me. The fact that @comfyanonymous is so stubborn about this as well doesn't help matters either. There are objective, qualitative deficencies in the way ComfyUI does weighting such that it is nearly impossible to get anything that even remotely has the same level of quality of A1111 when using weights, the difference in results should not be this huge between the same model. All of the examples that @comfyanonymous has provided just seem to reinforce @WASasquatch point about the weighting system being broken. In his weighted angry pictures, you can clearly see dark/black line artifacting and burning all across the image. It down right looks worse the higher the weight is, which is different in how the same prompt on the same model with the same weight looks in A1111, you just don't experience that kind of burning and artifacting there.

@comfyanonymous Seriously, add the other method of weighting as an alternative method of weighting if you don't want to get rid of the way you currently implement them, but for the love of god, if you can't see how bad the current method of weighting is you are blind.

Adding to this, Comfy++ seems to be a good middle ground overall, but even still, the A1111 method of weighting should remain available to users in order to replicate images that were generated on A1111

24 replies

WASasquatch Apr 15, 2023
Author

And, Seeds between A1111 and ComfyUI don't even produce the same results due to code base differences, so the comparison is basically moot unless someone was to implement the A1111 method of weighting in to ComfyUI and then compare it that way.

Yeah noticed that. Any sampler, different result. Can't even continue a project in ComfyyUI because it's fundamentally different in end-result. But according to Comfy, we should just conform to yet another person trying to re-shape the wheel to feel special... all for those fuzzy personal feelings. 🤮

BlenderNeko Apr 15, 2023

@WASasquatch welp.. good luck with your fork I guess..
@mariaWitch like this ? (will need my comfy fork atm to run it)

$@fractal-fumbler$

fractal-fumbler Apr 15, 2023

@mariaWitch, so what method diffusers use?

WASasquatch Apr 15, 2023
Author

@mariaWitch, so what method diffusers use?

Diffusers uses Compel, which is like InvokeAI, which is why people often just use LPW like Stable Diffusion Deluxe, NOP Notebook, Easy Diffusion, dozens of other notebooks, and hosted services like few startups I've worked for. Often locked in to older versions of diffuses cause community pipeline support is really bad. Often pipes will break cause of something pushed to diffusers, which then relies on the original author to update, if they're around, which subsequently requires PR to diffuses community section. Then often, you'll be sitting around waiting for approval and sometimes often making changes as they make changes in the meantime x.x

And ironically, two of these services I know are looking to switch to WebUI API since it supports multiple users using it at once now, and I suggested it to the third. Cause relying on a third party mod to diffusers for production is silly (similar to having to rely on a modified ComfyUI to use different conditioning custom node).

catboxanon Apr 16, 2023

Seeds between A1111 and ComfyUI don't even produce the same results due to code base differences

I think with the custom node for weighting now available, they should come very close, if not identical. The only other major difference, which is mentioned in the FAQ, is noise is generated on CPU instead of the GPU, something even Automatic mentions in his own code he should've done in the first place. https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/22bcc7be428c94e9408f589966c2040187245d81/modules/processing.py#L397-L400

If you want to generate the noise on the GPU you can modify this line here with the diff shown below. On my end I added a CLI arg to enable this only if I want it, since it really should be generated on CPU, but it's useful for debugging purposes. When I tested this without weights the results were identical to A1111 for me, and now that the custom node is released I'll do some further tests.

ComfyUI/nodes.py

Line 688 in 8cb54fa

    
           noise = torch.randn(latent_image.size(), dtype=latent_image.dtype, layout=latent_image.layout, generator=torch.manual_seed(seed), device="cpu")

- noise = torch.randn(latent_image.size(), dtype=latent_image.dtype, layout=latent_image.layout, generator=torch.manual_seed(seed), device="cpu")
+ noise = torch.randn(latent_image.size(), dtype=latent_image.dtype, layout=latent_image.layout, generator=torch.cuda.manual_seed(seed), device=device)

morphles · 2023-04-15T20:49:37Z

morphles
Apr 15, 2023

I just want to say, without understanding much technicalities - I'm getting some very nice results with comfy, some that I think no one even though possible (will do reddit write up in some days with more experiments). Not sure that it's soley to comfy ui, but just wanted to say - I completely do not buy that comfy prompt handling is inferior. It's different, does not mean it's worse. It's like people were railing on say blender interface, it never was that bad, they were just used to other stuff. So now one just need to get used to peculiarities of other tool. (well not really, as BlenderNeko seems to be working on some awesome stuff that will allow us to have both, and more). But saying that comfy prompt handling is shit, is not overly reasonable. Albeit understandable in wanting to get other handling implement. But I'd be quite against making current handling inaccessible/obsolete.

30 replies

WASasquatch Apr 17, 2023
Author

A detailed study would inherently just be that. Pictures. Lol

@WASasquatch You didn't have many pictures. If there is a problem, then convince others of it, and do not say that they are blind

However, BlenderNeko's research was much better, so listen to his before you write

I'm not interested in people entirely excusing the point, that is plainly there, that anyone with a comprehending why this matters understands (and even have iterated in this convo besides me). You're just deflecting. I've even explained directly why these issues wouldn't work for many instances. That's just wilful negligence.

Deterministic LDMs and reaching the best identity guidance performance, and efficiency is a direct end goal of this field. That's just that. This "better math" directly works against that and fails miserably in keeping the integrity of the image. And that's plainly evident in every example, even Blender's which you apparently didn't step through his example videos frames.

Any DDIM based sampling should be respecting the original samples by mapping the noise, with only marginable variance that could add up with high weighting. With Comfy, even at low weighting, this isn't true. Even .1 in, you're changing elements of the whole iamge

WASasquatch Apr 17, 2023
Author

And if you really can't see the difference in the whole images vs what's being weighted, please do yourself a favor and fire up good 'ol GIMP or Photoshop and use the original image, and difference all proceeding iterations. Any when it comes to ComfyUI, but with A1111 you'll notice that the BGs stay the same at low to moderate weighting minus some noise and slight variance, while what you are weighting is significantly being altered (and will show in difference).

Blenders example is a bit noisy as it's using generally composition-less element to weight (trees/forest type stuff).

LEv145 Apr 17, 2023

A detailed study would inherently just be that. Pictures. Lol

@WASasquatch You didn't have many pictures. If there is a problem, then convince others of it, and do not say that they are blind
However, BlenderNeko's research was much better, so listen to his before you write

I'm not interested in people entirely excusing the point, that is plainly there, that anyone with a comprehending why this matters understands (and even have iterated in this convo besides me). You're just deflecting. I've even explained directly why these issues wouldn't work for many instances. That's just wilful negligence.

Deterministic LDMs and reaching the best identity guidance performance, and efficiency is a direct end goal of this field. That's just that. This "better math" directly works against that and fails miserably in keeping the integrity of the image. And that's plainly evident in every example, even Blender's which you apparently didn't step through his example videos frames.

His example does not claim that the Comfy UI does badly, it does it differently
If you don't want to prove to others that you are right, then keep talking to yourself

WASasquatch Apr 17, 2023
Author

A detailed study would inherently just be that. Pictures. Lol

@WASasquatch You didn't have many pictures. If there is a problem, then convince others of it, and do not say that they are blind
However, BlenderNeko's research was much better, so listen to his before you write

I'm not interested in people entirely excusing the point, that is plainly there, that anyone with a comprehending why this matters understands (and even have iterated in this convo besides me). You're just deflecting. I've even explained directly why these issues wouldn't work for many instances. That's just wilful negligence.
Deterministic LDMs and reaching the best identity guidance performance, and efficiency is a direct end goal of this field. That's just that. This "better math" directly works against that and fails miserably in keeping the integrity of the image. And that's plainly evident in every example, even Blender's which you apparently didn't step through his example videos frames.

His example does not claim that the Comfy UI does badly, it does it differently If you don't want to prove to others that you are right, then keep talking to yourself

No, no, no. Catch up. The talk is A1111 is doing it "wrong" and as Comfy said, he's doing it right (and Blender corroborated), but is in fact breaking the DDIM samplers usages, that they are explicitly designed for. That's wrong.

LEv145 Apr 17, 2023

@WASasquatch :)

BlenderNeko · 2023-04-17T06:23:30Z

BlenderNeko
Apr 17, 2023

It's not doing x more than the others, you're lying

...

Okay it's doing x more than the others but that's not what you want

I'm just going to leave things here, it's abundantly clear that this discussion is no longer useful to have. Good luck with your fork

1 reply

WASasquatch Apr 17, 2023
Author

What do you even think that means. It doesn't do better at weighting things (immediate quality loss, not much changed in target in the average ranges, over whole image changed around it), and it doesn't do it deterministically for a DDIM process. It doesn't do anything better. It's not useful for a DDIM. It's not useful for coherent animation. It's not even useful for blended type diffusion and keeping original image and changing up one thing. Different sure, but that's irrelevant to the processes explained that should be respected. And it doesn't.

Oh yeah, can't forget how you can't use multiple embeddings properly, that were never designed explicitly for single use but always toted as a way to customize images with multiple embeddings, and how for some reason weighting embeddings doesn't seem to work and blows out image. Applying a face to a art style is abysmally difficult and ends up killing high level details (straight bad details, latent noise showing, hard contrast, whole host of issues)

CCRcmcpe · 2023-04-29T00:51:18Z

CCRcmcpe
Apr 29, 2023

If I got it correct, the core question (of the arguments above, not OP's question)is whether to normalize the weighted sequence to original mean.

Actually, there's another approach, which I thought is better than both. We are not supposed to directly mess with the embedding, but actually use attention masks for cross attentions.

1 reply

CCRcmcpe Apr 29, 2023

This is somewhat similar to Paint-with-words, however we are only dealing with one sequence. Also with this method we can achieve different weights (for one sequence) for different spatial locations.

unphased · 2023-07-14T07:43:43Z

unphased
Jul 14, 2023

This exchange was pretty weird. I think I read everything. And got a couple questions of my own that I don't think got addressed.

I'm interested in the final result at the end of the day because I'm pretty tired of the lack of automation in the a1111 interface, and I also find it to be a bit slow compared to comfy. I want my GPUs spending more time crunching, and less time waiting for the CPU to crank through inefficient code. The longer the GPU waits on the CPU for the next unit of work to perform, the deeper the thermal cycles are during extended operation, and the more physical wear endured by the solder holding the chips to the substrate. Ideally generation batches can be pipelined well enough to (VRAM permitting) allow for work to fully saturate the GPU keeping a constant level of utilization. But I digress.

As far as output quality, obviously this matters too. So it's worth trying to figure out what the real problems are and separate the wheat (what tools and workflows we should use to better achieve a given goal) from the chaff (arguments that cannot be supported by evidence).

First off, @WASasquatch: nobody is taking away your professional experience. I think I speak for most other readers here by stating (what I believe to be) the obvious, which is that simply being a professional does not make your opinion any more valid or more inherently presentable as fact. Please use evidence to support your arguments, not this kind of crutch (drivel of the "I am a professional", "I have been rubbing shoulders with artists for X years" form) to prop your argument up. You can review the emoji counts on your comments to gauge the effectiveness of your arguments. It's really strange to see 10 attempts to use words to try to explain why something is worse when any number of easy approaches could satisfy basic requirements to reproduce and demonstrate any given problem, and work to get your point across.
I kept reading in part because of very strongly worded arguments relating to the same pictures I also looked at, and I just couldn't for the life of me see the problems with the pictures. So it's a bit of a headscratcher. Could it really just be confirmation bias? This is why we need more data and evidence. I really want to see what you see @WASasquatch please help us help you.
I've only seen two sets of images shown that directly compare the same prompt, seeds, settings, and SD model with prompt S/R comparisons across ranges of weights. Since none of them appear to demonstrate obvious problems with "the comfy way" of handling weights, then if you do not want to do the same work of producing a direct comparison like this, then what you should do is take 5 minutes in Paint and draw us some circles and arrows and walk us through what you think the problems are. Upon reviewing the cyborg kitten example, all generations look great to me, and there is a similar quantity of backgrounds changing compared between the two examples (a1111 style weighting from comfy vs a1111). There is talk of "burning" and "artifacts", there's talk of dark lines... There was a (perhaps poorly chosen) prompt for "angry" for which darker features appearing seems reasonable. I'd love to know where the burning and artifacts supposedly are. That was not a comparison as it was only ever generated from the Comfy side. There is an example involving bottles, but it's not clear at all that both examples were made using the same parameters, or similar parameters, or even the same model or not... Returning to that notion of backgrounds not being deterministic... Nothing inherent to the way diffusion works allows you to guarantee that the background changes less. It may be the case that the Comfy Engine through whatever mathematical principle that exists that we don't have a name for yet makes backgrounds more dynamic (unproven!), even assuming this is the case, just because it is worse for your very specific use case doesn't mean it is worse for everyone. Please try to have even just a smidge of perspective here.
In terms of determinism. Now I'm fairly new to SD, I'm something like 6 weeks into it so far. But I think I have enough intuition about it to drop two cents. It's well understood that current generative AI is not good at maintaining coherence across generations. Presumably the workflow being alluded to would use the same prompt to trigger on source image frames from video, and having consistent content would be a desirable property of this kind of workflow. But it doesn't seem sensible to rely on these under the hood properties to hope for better results.

I just checked and controlnet 1.1 release was April 13. It really shook things up, because of how big of a jump in capability the 1.1 release provided. It may be likely that the optimal approach for whatever professional workflows @WASasquatch has been hinting at has changed around this time. But I've seen many posts that people have made using various tools to remix videos, using controlnet 1.1 and other state of the art techniques no doubt, but there always remain lots of shimmering and swimming of details and the very unnatural phenomenon of details materializing and disappearing, it has a very hyperdimensional alien effect. Sometimes people are able to use various techniques to blend/interpolate frames enough to make hair movements look halfway okay, but with that, bones are still swimming underneath the skin... Trust that there will be much more robust solutions for the problems in the temporal domain coming down the pipes. Bending over backwards trying to bend tech not designed for the purpose into fulfilling it will be an exercise in frustration. Better to focus on something else while it's being solved, and then once it is, there might not even be much of a financial reason to make movies anymore at that point (This is a joke). Trying to say that something is wrong or bad just because it is worse for a niche use case is not a reasonable thing to claim, especially if you do not even make an effort to demonstrate in what way and to what extent it is worse for that use case.

We should have the best of both worlds in comfy now that we have the new node contributed by @BlenderNeko . It would be nice if we got some confirmation from those with experience that the generation behavior of a1111 can now be replicated in Comfy by using this node. If the RNG can be made to line up we should be able to get the same outputs and can do pixel comparisons. If nobody here is willing to do that I might take a stab at it to satisfy my own curiosity so I know I can comfortably move forward with Comfy for all of my needs.

8 replies

WASasquatch Jul 16, 2023
Author

More like in open source software you should be open to discussion and people's experiences and professions, especially if you are not, and don't work within their fields you develop for, and not shoot it down rudely as an author like its only your project and no one else's.

God could you imagine Blender if they didn't feed off community requirements to get their jobs done? Lol it's all about meeting the demands of the customers, ironically. If they could go somewhere else cause you don't care about their needs, how does that look company image wise let alone functionality wise for them?

And mimicking A1111 is still the most requested support for ComfyUI, cause they want the quality they are used to working within. It's not like every 3D DCC dramatically changes the wheel of a vertex and mesh, etc. They all come together so their tools have a good chance of being used like another.

nohmsaiyan Jul 24, 2023

@WASasquatch holy hell you might be the greatest Ken I've ever encountered. I knew the from the very moment that @BlenderNeko offered a solution, you'd find a way to complain there to. The evidence you provided was purely emotional. You were humored here far too much.

WASasquatch Jul 24, 2023
Author

Ironically only by people with little grasp, who turn to the tools of losers in debate and rally like ochloratic fools. I don't care about your insults. It's still the number #1 comfy question for a blatantly obvious reason. Good majority of prompts are shit in ComfyUI without alteration. Even topics making the software out like shit cause of it, because no clarity of why it would be changed except "it's wrong". Yet ironically the generations from the masses are wrong in ComfyUI. There are topics in here, and posts all over Element of people just straight ditching ComfyUI cause blurry and blown out images. That's inherently a planning and decision problem. You can not care, that's fine. But it goes to show even with full on tabloid exposure ComfyUI is still the chump of the scene.

No one should have to jump through hoops with custom nodes or complicated workflows hardly discussed for simple shit everyone wants.

In a community ranked top SD platform, ComfyUI wouldn't go far actively working against user suggestions, feedback, etc. Let alone sticking it out on a failed front end no one uses cause it was put together terribly without extensibility or functionality in mind, so you can't even implement 90% of what needs to be done without someone else doing it and PRing it.

mariaWitch Jul 24, 2023

@WASasquatch holy hell you might be the greatest Ken I've ever encountered. I knew the from the very moment that @BlenderNeko offered a solution, you'd find a way to complain there to. The evidence you provided was purely emotional. You were humored here far too much.

>person points out that ComfyUI is using a verifiably worse weighting algorithm
>rather than actually taking the time to examine both and see if the claim is true, comfyanonymous sticks his fingers in his ear and pretends that everything is fine without actually examining it or providing proof that his method is better in a true 1-to-1 comparison
>continues to double and triple down even when shown that the weighting method is provably worse than A1111

I'm sorry @nohmsaiyan but if you can't recognize comfyanonymous's arrogance and down right disdain for anything that goes against what he thinks is right, then you are an idiot.

LEv145 Jul 25, 2023

I suggest ignoring those people who, instead of a sensible discussion, offend others
So I advise to read: https://en.wikipedia.org/wiki/List_of_fallacies

Temporal Weighting... Or whatever it's called #473

Replies: 21 comments · 129 replies

ltdrdata Apr 11, 2023 Collaborator

comfyanonymous Apr 11, 2023 Maintainer

WASasquatch Apr 11, 2023 Author

comfyanonymous Apr 11, 2023 Maintainer

WASasquatch Apr 11, 2023 Author

comfyanonymous Apr 11, 2023 Maintainer

WASasquatch Apr 11, 2023 Author

WASasquatch Apr 11, 2023 Author

comfyanonymous Apr 11, 2023 Maintainer

WASasquatch Apr 11, 2023 Author

WASasquatch Apr 11, 2023 Author

WASasquatch Apr 11, 2023 Author

WASasquatch Apr 11, 2023 Author

WASasquatch Apr 11, 2023 Author

comfyanonymous Apr 11, 2023 Maintainer

WASasquatch Apr 11, 2023 Author

WASasquatch Apr 12, 2023 Author

WASasquatch Apr 12, 2023 Author

comfyanonymous Apr 13, 2023 Maintainer

WASasquatch Apr 13, 2023 Author

WASasquatch Apr 13, 2023 Author

WASasquatch Apr 13, 2023 Author

comfyanonymous Apr 13, 2023 Maintainer

comfyanonymous Apr 12, 2023 Maintainer

WASasquatch Apr 12, 2023 Author

WASasquatch Apr 13, 2023 Author

ltdrdata Apr 13, 2023 Collaborator

WASasquatch Apr 13, 2023 Author

Replies: 21 comments 129 replies

ltdrdata
Apr 11, 2023
Collaborator

comfyanonymous
Apr 11, 2023
Maintainer

WASasquatch Apr 11, 2023
Author

comfyanonymous Apr 11, 2023
Maintainer

WASasquatch Apr 11, 2023
Author

comfyanonymous
Apr 11, 2023
Maintainer

WASasquatch Apr 11, 2023
Author

WASasquatch Apr 11, 2023
Author

comfyanonymous Apr 11, 2023
Maintainer

WASasquatch Apr 11, 2023
Author

WASasquatch Apr 11, 2023
Author

WASasquatch Apr 11, 2023
Author

WASasquatch Apr 11, 2023
Author

WASasquatch
Apr 11, 2023
Author

comfyanonymous
Apr 11, 2023
Maintainer

WASasquatch Apr 11, 2023
Author

WASasquatch Apr 12, 2023
Author

WASasquatch Apr 12, 2023
Author

comfyanonymous Apr 13, 2023
Maintainer

WASasquatch Apr 13, 2023
Author

WASasquatch Apr 13, 2023
Author

WASasquatch Apr 13, 2023
Author

comfyanonymous Apr 13, 2023
Maintainer

comfyanonymous Apr 12, 2023
Maintainer

WASasquatch Apr 12, 2023
Author

WASasquatch Apr 13, 2023
Author

ltdrdata
Apr 13, 2023
Collaborator

WASasquatch Apr 13, 2023
Author