Temporal Weighting... Or whatever it's called #473
Replies: 21 comments 129 replies
-
As far as I know, the weight in ComfyUI is the literal weight of the text, while A1111 normalizes the sum of the weights between prompts to 1.0. I understand that this is an intentional decision. If weight normalization is necessary, it is necessary to consider creating a new prompt that performs normalization or adding an option to the node to perform normalization of the prompt. |
Beta Was this translation helpful? Give feedback.
-
It's intentional: https://comfyanonymous.github.io/ComfyUI_examples/faq/ I think this way is much better and I don't like the way a1111 does it because it modifies the whole prompt (using incorrect math) as soon as you add one weight. |
Beta Was this translation helpful? Give feedback.
-
I should have picked a more complex prompt because that's where the differences become obvious. If you want it to behave that way you simply have to average the weights of the tokens. I have no plans on implementing this simply because I think it fits better with ComfyUI to have the weights actually match the ones people use in the prompt and I prefer it this way. Lora strengths don't get averaged out, unCLIP strengths don't get averaged out, Controlnet strengths don't get averaged out so averaging out prompt strengths wouldn't fit at all. |
Beta Was this translation helpful? Give feedback.
-
I'm going to have to agree with comfy here. Current implementation is very explicit and clear. It's also not very hard to get a custom node going that does auto normalizing. One thing that might make this slightly ugly is that the current codebase does the parsing/tokenizing and encoding in a single step, so you have to replicate a bit of the backend in order to do this. something my cutoff implementation also runs into, but is overall quite minimal. |
Beta Was this translation helpful? Give feedback.
-
I don't understand how you can say things are just fine when you can't do crap with weighting. Can't use more then one embed if they have high vectors as those take tokens, and can't underweight a embed to lessen the effect over others. You can't even control a normal prompt by making a bottle, or anything more emphasized over the rest of the prompt, making it essentially useless, even before the fact the weighting is causing bad image results. You're really fine with no one being able to use ComfyUI like everyone is used to, cause you have a personal opinion? Why not make that your personal builds functionality. Really I was wondering why ComfyUI wasn't as popular considering it's power, but if it produces bad images, you really can't get by that. Now that I can do what I want for professional workflows with a huge suite for post production, I find ComfyUI is useless for any of that. I have been doing digital art for over 25 years and there is no denying ComfyUI is producing terrible results with any sort of weighting or strong embeds. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
I don't see any ultimate disadvantage for any of the formulas. Instead, as I see, both of the formulas have reasons to be used. |
Beta Was this translation helpful? Give feedback.
-
I took a stab at a custom prompt node that runs it through this first. prompt_a1111.pyimport re
re_attention = re.compile(r"""
\\\(|
\\\)|
\\\[|
\\]|
\\\\|
\\|
\(|
\[|
:([+-]?[.\d]+)\)|
\)|
]|
[^\\()\[\]:]+|
:
""", re.X)
re_break = re.compile(r"\s*\bBREAK\b\s*", re.S)
def parse_prompt_attention(text):
"""
Parses a string with attention tokens and returns a list of pairs: text and its associated weight.
Accepted tokens are:
(abc) - increases attention to abc by a multiplier of 1.1
(abc:3.12) - increases attention to abc by a multiplier of 3.12
[abc] - decreases attention to abc by a multiplier of 1.1
\( - literal character '('
\[ - literal character '['
\) - literal character ')'
\] - literal character ']'
\\ - literal character '\'
anything else - just text
>>> parse_prompt_attention('normal text')
[['normal text', 1.0]]
>>> parse_prompt_attention('an (important) word')
[['an ', 1.0], ['important', 1.1], [' word', 1.0]]
>>> parse_prompt_attention('(unbalanced')
[['unbalanced', 1.1]]
>>> parse_prompt_attention('\(literal\]')
[['(literal]', 1.0]]
>>> parse_prompt_attention('(unnecessary)(parens)')
[['unnecessaryparens', 1.1]]
>>> parse_prompt_attention('a (((house:1.3)) [on] a (hill:0.5), sun, (((sky))).')
[['a ', 1.0],
['house', 1.5730000000000004],
[' ', 1.1],
['on', 1.0],
[' a ', 1.1],
['hill', 0.55],
[', sun, ', 1.1],
['sky', 1.4641000000000006],
['.', 1.1]]
"""
res = []
round_brackets = []
square_brackets = []
round_bracket_multiplier = 1.1
square_bracket_multiplier = 1 / 1.1
def multiply_range(start_position, multiplier):
for p in range(start_position, len(res)):
res[p][1] *= multiplier
for m in re_attention.finditer(text):
text = m.group(0)
weight = m.group(1)
if text.startswith('\\'):
res.append([text[1:], 1.0])
elif text == '(':
round_brackets.append(len(res))
elif text == '[':
square_brackets.append(len(res))
elif weight is not None and len(round_brackets) > 0:
multiply_range(round_brackets.pop(), float(weight))
elif text == ')' and len(round_brackets) > 0:
multiply_range(round_brackets.pop(), round_bracket_multiplier)
elif text == ']' and len(square_brackets) > 0:
multiply_range(square_brackets.pop(), square_bracket_multiplier)
else:
parts = re.split(re_break, text)
for i, part in enumerate(parts):
if i > 0:
res.append(["BREAK", -1])
res.append([part, 1.0])
for pos in round_brackets:
multiply_range(pos, round_bracket_multiplier)
for pos in square_brackets:
multiply_range(pos, square_bracket_multiplier)
if len(res) == 0:
res = [["", 1.0]]
# merge runs of identical weights
i = 0
while i + 1 < len(res):
if res[i][1] == res[i + 1][1]:
res[i][0] += res[i + 1][0]
res.pop(i + 1)
else:
i += 1
return ", ".join([f"({i[0]}:{i[1]})" for i in res])
class CLIPTextEncodeA1111:
@classmethod
def INPUT_TYPES(s):
return {"required": {"text": ("STRING", {"multiline": True}), "clip": ("CLIP", )}}
RETURN_TYPES = ("CONDITIONING",)
FUNCTION = "encode"
CATEGORY = "conditioning"
def encode(self, clip, text):
text = parse_prompt_attention(text)
print(text)
return ([[clip.encode(text), {}]], )
NODE_CLASS_MAPPINGS = {
"CLIPTextEncodeA1111": CLIPTextEncodeA1111
}
# A dictionary that contains the friendly/humanly readable titles for the nodes
NODE_DISPLAY_NAME_MAPPINGS = {
"CLIPTextEncodeA1111": "CLIP Text Encode (Auto1111)"
} |
Beta Was this translation helpful? Give feedback.
-
I tested the tags a bit on auto1111 and comfy with the same inputs ComfyUI: InputsVAE: |
Beta Was this translation helpful? Give feedback.
-
I'd just like to point out what you mention about A1111's syntax in the OP doesn't make sense and you may have a misunderstanding of the syntax. Either that or it's just the way you described it doesn't make sense to me.
Do you mean only applying for half of the generation time? Because that syntax is utilizing prompt editing for the brackets. https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#prompt-editing
Again, this is utilizing the prompt editing syntax. You are first applying emphasis to I'm only sharing this because from the discussion title I was expecting it to be about this feature, but instead got a lot of interesting discussion on weighting in general, with none of it related to being temporal. @BlenderNeko actually answered the initial question but I assume it just got glossed over once it was mentioned weighting is implemented differently. |
Beta Was this translation helpful? Give feedback.
-
As for me, there's at least one major reason to have a node with "a1111 calculation" in CUI. It's the ability to share prompts with the community, my local CUI and a1111. |
Beta Was this translation helpful? Give feedback.
-
I believe that while the a1111 approach has advantages, its effectiveness is unclear, much like a mysterious sauce. Although the outcomes can sometimes be favorable, there is a risk of unintended consequences that imply we may not have complete control over the prompt. As a result, rather than simply copying a1111 when upgrading the prompt, we should aim to find ways to explicitly control the effects of the prompt, allowing users to understand and manage them. I believe that even if the results turn out to be incorrect, users should be able to recognize which prompt caused the error. Therefore, it is important to ensure that the prompt allows for user recognition of the prompt's impact, even if it leads to undesirable outcomes. |
Beta Was this translation helpful? Give feedback.
-
I am fairly certain now that the changes in behavior are not due to the presence or absence of this re-normalization step. Taking the prompt from @LEv145 with cyborg at 1.5, normalizing according to the code snippet I posted earlier results in very minimal changes to both the CLIP embedding and the output. I've taken the liberty to just save the embeddings A1111 creates before and after this normalization step and use them in ComfyUI. Again, changes of before and after normalization are minimal at best, and the results from the unnormalized embeddings taken from A1111 used in comfy look just like the normalized embeddings look in A1111. Thus the difference is somewhere in how the CLIP embeddings are created or how the weights are applied, but they do not lie in this normalization step. I will try and investigate this further if/when I have time. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Alright, I made an AdvancedClipTextEncode node that allows for a bunch more options (excuse the current mess that is the readme, i'll work on that some more later). Right now this node relies on some changes made in the fork you can find here. I have not done extensive testing on all of this, but maybe somebody in here is interested in doing that. |
Beta Was this translation helpful? Give feedback.
-
I'll just say, even with all the banter seems you guys are going for amazing result, if it will be possible to choose whichever attention/token handling way one wants it just expands possibilities. And likely even understanding. Huge thanks for the work! |
Beta Was this translation helpful? Give feedback.
-
Going to jump in on this, WASasquatch is 100% right. In ALL of these images there is SIGNIFICANT burning caused when any weight is applied to any part of the prompt. The mere fact that this is the case, makes ComfyUI an almost non-starter for me. The fact that @comfyanonymous is so stubborn about this as well doesn't help matters either. There are objective, qualitative deficencies in the way ComfyUI does weighting such that it is nearly impossible to get anything that even remotely has the same level of quality of A1111 when using weights, the difference in results should not be this huge between the same model. All of the examples that @comfyanonymous has provided just seem to reinforce @WASasquatch point about the weighting system being broken. In his weighted angry pictures, you can clearly see dark/black line artifacting and burning all across the image. It down right looks worse the higher the weight is, which is different in how the same prompt on the same model with the same weight looks in A1111, you just don't experience that kind of burning and artifacting there. @comfyanonymous Seriously, add the other method of weighting as an alternative method of weighting if you don't want to get rid of the way you currently implement them, but for the love of god, if you can't see how bad the current method of weighting is you are blind. Adding to this, Comfy++ seems to be a good middle ground overall, but even still, the A1111 method of weighting should remain available to users in order to replicate images that were generated on A1111 |
Beta Was this translation helpful? Give feedback.
-
I just want to say, without understanding much technicalities - I'm getting some very nice results with comfy, some that I think no one even though possible (will do reddit write up in some days with more experiments). Not sure that it's soley to comfy ui, but just wanted to say - I completely do not buy that comfy prompt handling is inferior. It's different, does not mean it's worse. It's like people were railing on say blender interface, it never was that bad, they were just used to other stuff. So now one just need to get used to peculiarities of other tool. (well not really, as BlenderNeko seems to be working on some awesome stuff that will allow us to have both, and more). But saying that comfy prompt handling is shit, is not overly reasonable. Albeit understandable in wanting to get other handling implement. But I'd be quite against making current handling inaccessible/obsolete. |
Beta Was this translation helpful? Give feedback.
-
...
I'm just going to leave things here, it's abundantly clear that this discussion is no longer useful to have. Good luck with your fork |
Beta Was this translation helpful? Give feedback.
-
If I got it correct, the core question (of the arguments above, not OP's question)is whether to normalize the weighted sequence to original mean. Actually, there's another approach, which I thought is better than both. We are not supposed to directly mess with the embedding, but actually use attention masks for cross attentions. |
Beta Was this translation helpful? Give feedback.
-
This exchange was pretty weird. I think I read everything. And got a couple questions of my own that I don't think got addressed. I'm interested in the final result at the end of the day because I'm pretty tired of the lack of automation in the a1111 interface, and I also find it to be a bit slow compared to comfy. I want my GPUs spending more time crunching, and less time waiting for the CPU to crank through inefficient code. The longer the GPU waits on the CPU for the next unit of work to perform, the deeper the thermal cycles are during extended operation, and the more physical wear endured by the solder holding the chips to the substrate. Ideally generation batches can be pipelined well enough to (VRAM permitting) allow for work to fully saturate the GPU keeping a constant level of utilization. But I digress. As far as output quality, obviously this matters too. So it's worth trying to figure out what the real problems are and separate the wheat (what tools and workflows we should use to better achieve a given goal) from the chaff (arguments that cannot be supported by evidence).
I just checked and controlnet 1.1 release was April 13. It really shook things up, because of how big of a jump in capability the 1.1 release provided. It may be likely that the optimal approach for whatever professional workflows @WASasquatch has been hinting at has changed around this time. But I've seen many posts that people have made using various tools to remix videos, using controlnet 1.1 and other state of the art techniques no doubt, but there always remain lots of shimmering and swimming of details and the very unnatural phenomenon of details materializing and disappearing, it has a very hyperdimensional alien effect. Sometimes people are able to use various techniques to blend/interpolate frames enough to make hair movements look halfway okay, but with that, bones are still swimming underneath the skin... Trust that there will be much more robust solutions for the problems in the temporal domain coming down the pipes. Bending over backwards trying to bend tech not designed for the purpose into fulfilling it will be an exercise in frustration. Better to focus on something else while it's being solved, and then once it is, there might not even be much of a financial reason to make movies anymore at that point (This is a joke). Trying to say that something is wrong or bad just because it is worse for a niche use case is not a reasonable thing to claim, especially if you do not even make an effort to demonstrate in what way and to what extent it is worse for that use case. We should have the best of both worlds in comfy now that we have the new node contributed by @BlenderNeko . It would be nice if we got some confirmation from those with experience that the generation behavior of a1111 can now be replicated in Comfy by using this node. If the RNG can be made to line up we should be able to get the same outputs and can do pixel comparisons. If nobody here is willing to do that I might take a stab at it to satisfy my own curiosity so I know I can comfortably move forward with Comfy for all of my needs. |
Beta Was this translation helpful? Give feedback.
-
In A1111, you can do things like
[((theEmbded)):0.5]
for a strong effect only applied at half weight which balanced the strong weighting of the embed (or whatever you are doing text wise). You could even do[(theEmbed):1.5]
for a strong effect that overpowers other embeds a bit so they balance out better (like subject vs style), but in ComfyUI, even one level of weighting causes the embedding to blow out the image (hard color burns, hard contrast, weird chromatic aberration effect).It doesn't seem the weighting is as versatile in ComfyUI, and maybe could use improvement? A lot of my prompts simply don't work even beyond converting embeds, cause the weighting looks really bad in ComfyUI.
Beta Was this translation helpful? Give feedback.
All reactions