-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
why the output is blank #81
Comments
same issue |
dude, i think i might solve this problem. When I was implementing another project(https://github.com/IVGSZ/Flash-VStream?tab=readme-ov-file#structure). I first convert videos into video features(safesensors) using ViT. Then it gives me the correct answer. I guess things are same here. Becasue I didn't see code downloading ViT weights from openai in PLLaVA. I guess we need to upload safesensors files instead of videos. |
Thank you for your help! i fix this problem yesterday by use the "lora" and "weigth path" args. it works but i dont know why, hh. I will try your method recently.
At 2024-09-10 16:30:28, "Jincheng Zhou" ***@***.***> wrote:
dude, i think i might solve this problem. When I was implementing another project(https://github.com/IVGSZ/Flash-VStream?tab=readme-ov-file#structure). I first convert videos into video features(safesensors) using ViT. Then it gives me the correct answer. I guess things are same here. Becasue I didn't see code downloading ViT weights from openai in PLLaVA. I guess we need to upload safesensors files instead of videos.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
I met this issue too,how did you solve it? |
I have build my own demo file. after uploading one video, it gives blank output. Could anyone help me out?
-------------------------------Here's the demo file-------------------------
from argparse import ArgumentParser
from tasks.eval.model_utils import load_pllava
from tasks.eval.eval_utils import ChatPllava, conv_plain_v1, conv_templates
SYSTEM = """You are a powerful Video Magic ChatBot, a large vision-language assistant.
You are able to understand the video content that the user provides and assist the user in a video-language related task.
The user might provide you with the video and maybe some extra noisy information to help you out or ask you a question. Make use of the information in a proper way to be competent for the job.
INSTRUCTIONS:
"""
INIT_CONVERSATION = conv_plain_v1.copy()
def init_model(args):
print('Initializing PLLaVA')
model, processor = load_pllava(
args.pretrained_model_name_or_path, args.num_frames,
use_lora=args.use_lora,
weight_dir=args.weight_dir,
lora_alpha=args.lora_alpha,
use_multi_gpus=args.use_multi_gpus)
if not args.use_multi_gpus:
model = model.to('cuda')
chat = ChatPllava(model, processor)
return chat
def run_inference(video_path, question, chat, num_beams=1, temperature=1.0):
# Upload the video
llm_message, img_list, chat_state = chat.upload_video(video_path, INIT_CONVERSATION.copy(), [])
def parse_args():
parser = ArgumentParser()
parser.add_argument("--pretrained_model_name_or_path", type=str, required=False, default='/home/PLLaVA/MODELS/pllava-13b')
parser.add_argument("--num_frames", type=int, required=False, default=4)
parser.add_argument("--use_lora", action='store_true')
parser.add_argument("--use_multi_gpus", action='store_true')
parser.add_argument("--lora_alpha", type=int, default=None, help="LoRA alpha parameter.")
parser.add_argument("--weight_dir", type=str, required=False, default=None)
parser.add_argument("--conv_mode", type=str, required=False, default='eval_vcgbench')
parser.add_argument("--video_path", type=str, required=False, default="/home/PLLaVA/EDATA/VCGBench/Test_Videos/v_-D1gdv_gQyw.mp4")
parser.add_argument("--question", type=str, required=False, default="describe the video")
parser.add_argument("--num_beams", type=int, default=1, help="Beam search numbers.")
parser.add_argument("--temperature", type=float, default=1.0, help="Temperature for text generation.")
args = parser.parse_args()
return args
if name == "main":
args = parse_args()
-------------------------------Here's the output-------------------------------
Initializing PLLaVA
Loading model from /home/PLLaVA/MODELS/pllava-13b
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████| 6/6 [00:20<00:00, 3.38s/it]
Some weights of PllavaForConditionalGeneration were not initialized from the model checkpoint at /home/PLLaVA/MODELS/pllava-13b and are newly initialized: ['language_model.lm_head.weight', 'language_model.model.embed_tokens.weight', 'language_model.model.layers.0.input_layernorm.weight', 'language_model.model.layers.0.mlp.down_proj.weight', 'language_model.model.layers.0.mlp.gate_proj.weight', 'language_model.model.layers.0.mlp.up_proj.weight', 'language_model.model.layers.0.post_attention_layernorm.weight', 'language_model.model.layers.0.self_attn.k_proj.weight', 'language_model.model.layers.0.self_attn.o_proj.weight', 'language_model.model.layers.0.self_attn.q_proj.weight', 'language_model.model.layers.0.self_attn.v_proj.weight', 'language_model.model.layers.1.input_layernorm.weight', 'language_model.model.layers.1.mlp.down_proj.weight', 'language_model.model.layers.1.mlp.gate_proj.weight', 'language_model.model.layers.1.mlp.up_proj.weight', 'language_model.model.layers.1.post_attention_layernorm.weight', 'language_model.model.layers.1.self_attn.k_proj.weight', 'language_model.model.layers.1.self_attn.o_proj.weight', 'language_model.model.layers.1.self_attn.q_proj.weight', 'language_model.model.layers.1.self_attn.v_proj.weight', 'language_model.model.layers.10.input_layernorm.weight', 'language_model.model.layers.10.mlp.down_proj.weight', 'language_model.model.layers.10.mlp.gate_proj.weight', 'language_model.model.layers.10.mlp.up_proj.weight', 'language_model.model.layers.10.post_attention_layernorm.weight', 'language_model.model.layers.10.self_attn.k_proj.weight', 'language_model.model.layers.10.self_attn.o_proj.weight', 'language_model.model.layers.10.self_attn.q_proj.weight', 'language_model.model.layers.10.self_attn.v_proj.weight', 'language_model.model.layers.11.input_layernorm.weight', 'language_model.model.layers.11.mlp.down_proj.weight', 'language_model.model.layers.11.mlp.gate_proj.weight', 'language_model.model.layers.11.mlp.up_proj.weight', 'language_model.model.layers.11.post_attention_layernorm.weight', 'language_model.model.layers.11.self_attn.k_proj.weight', 'language_model.model.layers.11.self_attn.o_proj.weight', 'language_model.model.layers.11.self_attn.q_proj.weight', 'language_model.model.layers.11.self_attn.v_proj.weight', 'language_model.model.layers.12.input_layernorm.weight', 'language_model.model.layers.12.mlp.down_proj.weight', 'language_model.model.layers.12.mlp.gate_proj.weight', 'language_model.model.layers.12.mlp.up_proj.weight', 'language_model.model.layers.12.post_attention_layernorm.weight', 'language_model.model.layers.12.self_attn.k_proj.weight', 'language_model.model.layers.12.self_attn.o_proj.weight', 'language_model.model.layers.12.self_attn.q_proj.weight', 'language_model.model.layers.12.self_attn.v_proj.weight', 'language_model.model.layers.13.input_layernorm.weight', 'language_model.model.layers.13.mlp.down_proj.weight', 'language_model.model.layers.13.mlp.gate_proj.weight', 'language_model.model.layers.13.mlp.up_proj.weight', 'language_model.model.layers.13.post_attention_layernorm.weight', 'language_model.model.layers.13.self_attn.k_proj.weight', 'language_model.model.layers.13.self_attn.o_proj.weight', 'language_model.model.layers.13.self_attn.q_proj.weight', 'language_model.model.layers.13.self_attn.v_proj.weight', 'language_model.model.layers.14.input_layernorm.weight', 'language_model.model.layers.14.mlp.down_proj.weight', 'language_model.model.layers.14.mlp.gate_proj.weight', 'language_model.model.layers.14.mlp.up_proj.weight', 'language_model.model.layers.14.post_attention_layernorm.weight', 'language_model.model.layers.14.self_attn.k_proj.weight', 'language_model.model.layers.14.self_attn.o_proj.weight', 'language_model.model.layers.14.self_attn.q_proj.weight', 'language_model.model.layers.14.self_attn.v_proj.weight', 'language_model.model.layers.15.input_layernorm.weight', 'language_model.model.layers.15.mlp.down_proj.weight', 'language_model.model.layers.15.mlp.gate_proj.weight', 'language_model.model.layers.15.mlp.up_proj.weight', 'language_model.model.layers.15.post_attention_layernorm.weight', 'language_model.model.layers.15.self_attn.k_proj.weight', 'language_model.model.layers.15.self_attn.o_proj.weight', 'language_model.model.layers.15.self_attn.q_proj.weight', 'language_model.model.layers.15.self_attn.v_proj.weight', 'language_model.model.layers.16.input_layernorm.weight', 'language_model.model.layers.16.mlp.down_proj.weight', 'language_model.model.layers.16.mlp.gate_proj.weight', 'language_model.model.layers.16.mlp.up_proj.weight', 'language_model.model.layers.16.post_attention_layernorm.weight', 'language_model.model.layers.16.self_attn.k_proj.weight', 'language_model.model.layers.16.self_attn.o_proj.weight', 'language_model.model.layers.16.self_attn.q_proj.weight', 'language_model.model.layers.16.self_attn.v_proj.weight', 'language_model.model.layers.17.input_layernorm.weight', 'language_model.model.layers.17.mlp.down_proj.weight', 'language_model.model.layers.17.mlp.gate_proj.weight', 'language_model.model.layers.17.mlp.up_proj.weight', 'language_model.model.layers.17.post_attention_layernorm.weight', 'language_model.model.layers.17.self_attn.k_proj.weight', 'language_model.model.layers.17.self_attn.o_proj.weight', 'language_model.model.layers.17.self_attn.q_proj.weight', 'language_model.model.layers.17.self_attn.v_proj.weight', 'language_model.model.layers.18.input_layernorm.weight', 'language_model.model.layers.18.mlp.down_proj.weight', 'language_model.model.layers.18.mlp.gate_proj.weight', 'language_model.model.layers.18.mlp.up_proj.weight', 'language_model.model.layers.18.post_attention_layernorm.weight', 'language_model.model.layers.18.self_attn.k_proj.weight', 'language_model.model.layers.18.self_attn.o_proj.weight', 'language_model.model.layers.18.self_attn.q_proj.weight', 'language_model.model.layers.18.self_attn.v_proj.weight', 'language_model.model.layers.19.input_layernorm.weight', 'language_model.model.layers.19.mlp.down_proj.weight', 'language_model.model.layers.19.mlp.gate_proj.weight', 'language_model.model.layers.19.mlp.up_proj.weight', 'language_model.model.layers.19.post_attention_layernorm.weight', 'language_model.model.layers.19.self_attn.k_proj.weight', 'language_model.model.layers.19.self_attn.o_proj.weight', 'language_model.model.layers.19.self_attn.q_proj.weight', 'language_model.model.layers.19.self_attn.v_proj.weight', 'language_model.model.layers.2.input_layernorm.weight', 'language_model.model.layers.2.mlp.down_proj.weight', 'language_model.model.layers.2.mlp.gate_proj.weight', 'language_model.model.layers.2.mlp.up_proj.weight', 'language_model.model.layers.2.post_attention_layernorm.weight', 'language_model.model.layers.2.self_attn.k_proj.weight', 'language_model.model.layers.2.self_attn.o_proj.weight', 'language_model.model.layers.2.self_attn.q_proj.weight', 'language_model.model.layers.2.self_attn.v_proj.weight', 'language_model.model.layers.20.input_layernorm.weight', 'language_model.model.layers.20.mlp.down_proj.weight', 'language_model.model.layers.20.mlp.gate_proj.weight', 'language_model.model.layers.20.mlp.up_proj.weight', 'language_model.model.layers.20.post_attention_layernorm.weight', 'language_model.model.layers.20.self_attn.k_proj.weight', 'language_model.model.layers.20.self_attn.o_proj.weight', 'language_model.model.layers.20.self_attn.q_proj.weight', 'language_model.model.layers.20.self_attn.v_proj.weight', 'language_model.model.layers.21.input_layernorm.weight', 'language_model.model.layers.21.mlp.down_proj.weight', 'language_model.model.layers.21.mlp.gate_proj.weight', 'language_model.model.layers.21.mlp.up_proj.weight', 'language_model.model.layers.21.post_attention_layernorm.weight', 'language_model.model.layers.21.self_attn.k_proj.weight', 'language_model.model.layers.21.self_attn.o_proj.weight', 'language_model.model.layers.21.self_attn.q_proj.weight', 'language_model.model.layers.21.self_attn.v_proj.weight', 'language_model.model.layers.22.input_layernorm.weight', 'language_model.model.layers.22.mlp.down_proj.weight', 'language_model.model.layers.22.mlp.gate_proj.weight', 'language_model.model.layers.22.mlp.up_proj.weight', 'language_model.model.layers.22.post_attention_layernorm.weight', 'language_model.model.layers.22.self_attn.k_proj.weight', 'language_model.model.layers.22.self_attn.o_proj.weight', 'language_model.model.layers.22.self_attn.q_proj.weight', 'language_model.model.layers.22.self_attn.v_proj.weight', 'language_model.model.layers.23.input_layernorm.weight', 'language_model.model.layers.23.mlp.down_proj.weight', 'language_model.model.layers.23.mlp.gate_proj.weight', 'language_model.model.layers.23.mlp.up_proj.weight', 'language_model.model.layers.23.post_attention_layernorm.weight', 'language_model.model.layers.23.self_attn.k_proj.weight', 'language_model.model.layers.23.self_attn.o_proj.weight', 'language_model.model.layers.23.self_attn.q_proj.weight', 'language_model.model.layers.23.self_attn.v_proj.weight', 'language_model.model.layers.24.input_layernorm.weight', 'language_model.model.layers.24.mlp.down_proj.weight', 'language_model.model.layers.24.mlp.gate_proj.weight', 'language_model.model.layers.24.mlp.up_proj.weight', 'language_model.model.layers.24.post_attention_layernorm.weight', 'language_model.model.layers.24.self_attn.k_proj.weight', 'language_model.model.layers.24.self_attn.o_proj.weight', 'language_model.model.layers.24.self_attn.q_proj.weight', 'language_model.model.layers.24.self_attn.v_proj.weight', 'language_model.model.layers.25.input_layernorm.weight', 'language_model.model.layers.25.mlp.down_proj.weight', 'language_model.model.layers.25.mlp.gate_proj.weight', 'language_model.model.layers.25.mlp.up_proj.weight', 'language_model.model.layers.25.post_attention_layernorm.weight', 'language_model.model.layers.25.self_attn.k_proj.weight', 'language_model.model.layers.25.self_attn.o_proj.weight', 'language_model.model.layers.25.self_attn.q_proj.weight', 'language_model.model.layers.25.self_attn.v_proj.weight', 'language_model.model.layers.26.input_layernorm.weight', 'language_model.model.layers.26.mlp.down_proj.weight', 'language_model.model.layers.26.mlp.gate_proj.weight', 'language_model.model.layers.26.mlp.up_proj.weight', 'language_model.model.layers.26.post_attention_layernorm.weight', 'language_model.model.layers.26.self_attn.k_proj.weight', 'language_model.model.layers.26.self_attn.o_proj.weight', 'language_model.model.layers.26.self_attn.q_proj.weight', 'language_model.model.layers.26.self_attn.v_proj.weight', 'language_model.model.layers.27.input_layernorm.weight', 'language_model.model.layers.27.mlp.down_proj.weight', 'language_model.model.layers.27.mlp.gate_proj.weight', 'language_model.model.layers.27.mlp.up_proj.weight', 'language_model.model.layers.27.post_attention_layernorm.weight', 'language_model.model.layers.27.self_attn.k_proj.weight', 'language_model.model.layers.27.self_attn.o_proj.weight', 'language_model.model.layers.27.self_attn.q_proj.weight', 'language_model.model.layers.27.self_attn.v_proj.weight', 'language_model.model.layers.28.input_layernorm.weight', 'language_model.model.layers.28.mlp.down_proj.weight', 'language_model.model.layers.28.mlp.gate_proj.weight', 'language_model.model.layers.28.mlp.up_proj.weight', 'language_model.model.layers.28.post_attention_layernorm.weight', 'language_model.model.layers.28.self_attn.k_proj.weight', 'language_model.model.layers.28.self_attn.o_proj.weight', 'language_model.model.layers.28.self_attn.q_proj.weight', 'language_model.model.layers.28.self_attn.v_proj.weight', 'language_model.model.layers.29.input_layernorm.weight', 'language_model.model.layers.29.mlp.down_proj.weight', 'language_model.model.layers.29.mlp.gate_proj.weight', 'language_model.model.layers.29.mlp.up_proj.weight', 'language_model.model.layers.29.post_attention_layernorm.weight', 'language_model.model.layers.29.self_attn.k_proj.weight', 'language_model.model.layers.29.self_attn.o_proj.weight', 'language_model.model.layers.29.self_attn.q_proj.weight', 'language_model.model.layers.29.self_attn.v_proj.weight', 'language_model.model.layers.3.input_layernorm.weight', 'language_model.model.layers.3.mlp.down_proj.weight', 'language_model.model.layers.3.mlp.gate_proj.weight', 'language_model.model.layers.3.mlp.up_proj.weight', 'language_model.model.layers.3.post_attention_layernorm.weight', 'language_model.model.layers.3.self_attn.k_proj.weight', 'language_model.model.layers.3.self_attn.o_proj.weight', 'language_model.model.layers.3.self_attn.q_proj.weight', 'language_model.model.layers.3.self_attn.v_proj.weight', 'language_model.model.layers.30.input_layernorm.weight', 'language_model.model.layers.30.mlp.down_proj.weight', 'language_model.model.layers.30.mlp.gate_proj.weight', 'language_model.model.layers.30.mlp.up_proj.weight', 'language_model.model.layers.30.post_attention_layernorm.weight', 'language_model.model.layers.30.self_attn.k_proj.weight', 'language_model.model.layers.30.self_attn.o_proj.weight', 'language_model.model.layers.30.self_attn.q_proj.weight', 'language_model.model.layers.30.self_attn.v_proj.weight', 'language_model.model.layers.31.input_layernorm.weight', 'language_model.model.layers.31.mlp.down_proj.weight', 'language_model.model.layers.31.mlp.gate_proj.weight', 'language_model.model.layers.31.mlp.up_proj.weight', 'language_model.model.layers.31.post_attention_layernorm.weight', 'language_model.model.layers.31.self_attn.k_proj.weight', 'language_model.model.layers.31.self_attn.o_proj.weight', 'language_model.model.layers.31.self_attn.q_proj.weight', 'language_model.model.layers.31.self_attn.v_proj.weight', 'language_model.model.layers.32.input_layernorm.weight', 'language_model.model.layers.32.mlp.down_proj.weight', 'language_model.model.layers.32.mlp.gate_proj.weight', 'language_model.model.layers.32.mlp.up_proj.weight', 'language_model.model.layers.32.post_attention_layernorm.weight', 'language_model.model.layers.32.self_attn.k_proj.weight', 'language_model.model.layers.32.self_attn.o_proj.weight', 'language_model.model.layers.32.self_attn.q_proj.weight', 'language_model.model.layers.32.self_attn.v_proj.weight', 'language_model.model.layers.33.input_layernorm.weight', 'language_model.model.layers.33.mlp.down_proj.weight', 'language_model.model.layers.33.mlp.gate_proj.weight', 'language_model.model.layers.33.mlp.up_proj.weight', 'language_model.model.layers.33.post_attention_layernorm.weight', 'language_model.model.layers.33.self_attn.k_proj.weight', 'language_model.model.layers.33.self_attn.o_proj.weight', 'language_model.model.layers.33.self_attn.q_proj.weight', 'language_model.model.layers.33.self_attn.v_proj.weight', 'language_model.model.layers.34.input_layernorm.weight', 'language_model.model.layers.34.mlp.down_proj.weight', 'language_model.model.layers.34.mlp.gate_proj.weight', 'language_model.model.layers.34.mlp.up_proj.weight', 'language_model.model.layers.34.post_attention_layernorm.weight', 'language_model.model.layers.34.self_attn.k_proj.weight', 'language_model.model.layers.34.self_attn.o_proj.weight', 'language_model.model.layers.34.self_attn.q_proj.weight', 'language_model.model.layers.34.self_attn.v_proj.weight', 'language_model.model.layers.35.input_layernorm.weight', 'language_model.model.layers.35.mlp.down_proj.weight', 'language_model.model.layers.35.mlp.gate_proj.weight', 'language_model.model.layers.35.mlp.up_proj.weight', 'language_model.model.layers.35.post_attention_layernorm.weight', 'language_model.model.layers.35.self_attn.k_proj.weight', 'language_model.model.layers.35.self_attn.o_proj.weight', 'language_model.model.layers.35.self_attn.q_proj.weight', 'language_model.model.layers.35.self_attn.v_proj.weight', 'language_model.model.layers.36.input_layernorm.weight', 'language_model.model.layers.36.mlp.down_proj.weight', 'language_model.model.layers.36.mlp.gate_proj.weight', 'language_model.model.layers.36.mlp.up_proj.weight', 'language_model.model.layers.36.post_attention_layernorm.weight', 'language_model.model.layers.36.self_attn.k_proj.weight', 'language_model.model.layers.36.self_attn.o_proj.weight', 'language_model.model.layers.36.self_attn.q_proj.weight', 'language_model.model.layers.36.self_attn.v_proj.weight', 'language_model.model.layers.37.input_layernorm.weight', 'language_model.model.layers.37.mlp.down_proj.weight', 'language_model.model.layers.37.mlp.gate_proj.weight', 'language_model.model.layers.37.mlp.up_proj.weight', 'language_model.model.layers.37.post_attention_layernorm.weight', 'language_model.model.layers.37.self_attn.k_proj.weight', 'language_model.model.layers.37.self_attn.o_proj.weight', 'language_model.model.layers.37.self_attn.q_proj.weight', 'language_model.model.layers.37.self_attn.v_proj.weight', 'language_model.model.layers.38.input_layernorm.weight', 'language_model.model.layers.38.mlp.down_proj.weight', 'language_model.model.layers.38.mlp.gate_proj.weight', 'language_model.model.layers.38.mlp.up_proj.weight', 'language_model.model.layers.38.post_attention_layernorm.weight', 'language_model.model.layers.38.self_attn.k_proj.weight', 'language_model.model.layers.38.self_attn.o_proj.weight', 'language_model.model.layers.38.self_attn.q_proj.weight', 'language_model.model.layers.38.self_attn.v_proj.weight', 'language_model.model.layers.39.input_layernorm.weight', 'language_model.model.layers.39.mlp.down_proj.weight', 'language_model.model.layers.39.mlp.gate_proj.weight', 'language_model.model.layers.39.mlp.up_proj.weight', 'language_model.model.layers.39.post_attention_layernorm.weight', 'language_model.model.layers.39.self_attn.k_proj.weight', 'language_model.model.layers.39.self_attn.o_proj.weight', 'language_model.model.layers.39.self_attn.q_proj.weight', 'language_model.model.layers.39.self_attn.v_proj.weight', 'language_model.model.layers.4.input_layernorm.weight', 'language_model.model.layers.4.mlp.down_proj.weight', 'language_model.model.layers.4.mlp.gate_proj.weight', 'language_model.model.layers.4.mlp.up_proj.weight', 'language_model.model.layers.4.post_attention_layernorm.weight', 'language_model.model.layers.4.self_attn.k_proj.weight', 'language_model.model.layers.4.self_attn.o_proj.weight', 'language_model.model.layers.4.self_attn.q_proj.weight', 'language_model.model.layers.4.self_attn.v_proj.weight', 'language_model.model.layers.5.input_layernorm.weight', 'language_model.model.layers.5.mlp.down_proj.weight', 'language_model.model.layers.5.mlp.gate_proj.weight', 'language_model.model.layers.5.mlp.up_proj.weight', 'language_model.model.layers.5.post_attention_layernorm.weight', 'language_model.model.layers.5.self_attn.k_proj.weight', 'language_model.model.layers.5.self_attn.o_proj.weight', 'language_model.model.layers.5.self_attn.q_proj.weight', 'language_model.model.layers.5.self_attn.v_proj.weight', 'language_model.model.layers.6.input_layernorm.weight', 'language_model.model.layers.6.mlp.down_proj.weight', 'language_model.model.layers.6.mlp.gate_proj.weight', 'language_model.model.layers.6.mlp.up_proj.weight', 'language_model.model.layers.6.post_attention_layernorm.weight', 'language_model.model.layers.6.self_attn.k_proj.weight', 'language_model.model.layers.6.self_attn.o_proj.weight', 'language_model.model.layers.6.self_attn.q_proj.weight', 'language_model.model.layers.6.self_attn.v_proj.weight', 'language_model.model.layers.7.input_layernorm.weight', 'language_model.model.layers.7.mlp.down_proj.weight', 'language_model.model.layers.7.mlp.gate_proj.weight', 'language_model.model.layers.7.mlp.up_proj.weight', 'language_model.model.layers.7.post_attention_layernorm.weight', 'language_model.model.layers.7.self_attn.k_proj.weight', 'language_model.model.layers.7.self_attn.o_proj.weight', 'language_model.model.layers.7.self_attn.q_proj.weight', 'language_model.model.layers.7.self_attn.v_proj.weight', 'language_model.model.layers.8.input_layernorm.weight', 'language_model.model.layers.8.mlp.down_proj.weight', 'language_model.model.layers.8.mlp.gate_proj.weight', 'language_model.model.layers.8.mlp.up_proj.weight', 'language_model.model.layers.8.post_attention_layernorm.weight', 'language_model.model.layers.8.self_attn.k_proj.weight', 'language_model.model.layers.8.self_attn.o_proj.weight', 'language_model.model.layers.8.self_attn.q_proj.weight', 'language_model.model.layers.8.self_attn.v_proj.weight', 'language_model.model.layers.9.input_layernorm.weight', 'language_model.model.layers.9.mlp.down_proj.weight', 'language_model.model.layers.9.mlp.gate_proj.weight', 'language_model.model.layers.9.mlp.up_proj.weight', 'language_model.model.layers.9.post_attention_layernorm.weight', 'language_model.model.layers.9.self_attn.k_proj.weight', 'language_model.model.layers.9.self_attn.o_proj.weight', 'language_model.model.layers.9.self_attn.q_proj.weight', 'language_model.model.layers.9.self_attn.v_proj.weight', 'language_model.model.norm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Model loaded
Loading processor from /home/PLLaVA/MODELS/pllava-13b
Processor loaded
Frame size: (1280, 720), Mode: RGB
Frame size: (1280, 720), Mode: RGB
Frame size: (1280, 720), Mode: RGB
Frame size: (1280, 720), Mode: RGB
Input video shape: 4 1280 720
img_list shape: [(1280, 720), (1280, 720), (1280, 720), (1280, 720)]
Uploaded 1 images for processing.
Generated Prompt: You are a powerful Video Magic ChatBot, a large vision-language assistant.
You are able to understand the video content that the user provides and assist the user in a video-language related task.
The user might provide you with the video and maybe some extra noisy information to help you out or ask you a question. Make use of the information in a proper way to be competent for the job.
INSTRUCTIONS:
Follow the user's instruction.
Be critical yet believe in yourself.
USER:
USER: USER: describe the video ASSISTANT:
Processed inputs keys: dict_keys(['pixel_values', 'input_ids', 'attention_mask'])
Pixel values shape: torch.Size([4, 3, 336, 336])
Pixel values: tensor([[[[-1.1061, -1.1061, -1.1353, ..., -0.2010, -0.2302, -0.2448],
[-1.1353, -1.1207, -1.1353, ..., -0.2594, -0.2740, -0.2594],
[-1.1353, -1.1353, -1.1353, ..., -0.0842, -0.2448, -0.2886],
...,
[-0.3908, -0.2302, -0.1280, ..., 0.0325, 0.0179, 0.0179],
[-0.3470, -0.1864, -0.0988, ..., 0.0325, 0.0179, 0.0179],
[-0.3178, -0.1426, -0.0550, ..., 0.0325, 0.0179, 0.0179]],
Input IDs shape: torch.Size([1, 140])
Input IDs: tensor([[ 1, 887, 526, 263, 13988, 13987, 26494, 678, 271, 29933,
327, 29892, 263, 2919, 18551, 29899, 11675, 20255, 29889, 29871,
13, 3492, 526, 2221, 304, 2274, 278, 4863, 2793, 393,
278, 1404, 8128, 322, 6985, 278, 1404, 297, 263, 4863,
29899, 11675, 4475, 3414, 29889, 13, 1576, 1404, 1795, 3867,
366, 411, 278, 4863, 322, 5505, 777, 4805, 694, 13344,
2472, 304, 1371, 366, 714, 470, 2244, 366, 263, 1139,
29889, 8561, 671, 310, 278, 2472, 297, 263, 1571, 982,
304, 367, 5100, 296, 363, 278, 4982, 29889, 13, 2277,
29937, 2672, 10810, 29965, 9838, 29903, 29901, 13, 29896, 29889,
10306, 278, 1404, 29915, 29879, 15278, 29889, 13, 29906, 29889,
1522, 12187, 3447, 4658, 297, 7535, 29889, 13, 3148, 1001,
29901, 29871, 32000, 29871, 13, 3148, 1001, 29901, 29871, 3148,
1001, 29901, 8453, 278, 4863, 319, 1799, 9047, 13566, 29901]])
###PROMPT: You are a powerful Video Magic ChatBot, a large vision-language assistant.
You are able to understand the video content that the user provides and assist the user in a video-language related task.
The user might provide you with the video and maybe some extra noisy information to help you out or ask you a question. Make use of the information in a proper way to be competent for the job.
INSTRUCTIONS:
USER:
USER: USER: describe the video ASSISTANT:
###LM OUTPUT TEXT You are a powerful Video Magic ChatBot, a large vision-language assistant.
You are able to understand the video content that the user provides and assist the user in a video-language related task.
The user might provide you with the video and maybe some extra noisy information to help you out or ask you a question. Make use of the information in a proper way to be competent for the job.
INSTRUCTIONS:
USER:
USER: USER: describe the video ASSISTANT:
Question: describe the video
Answer:
The text was updated successfully, but these errors were encountered: