Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

❌ create_final_entities : Unable to run GraphRAG Pipeline #606

Closed
Jainil-Gosalia opened this issue Jul 18, 2024 · 2 comments
Closed

❌ create_final_entities : Unable to run GraphRAG Pipeline #606

Jainil-Gosalia opened this issue Jul 18, 2024 · 2 comments
Labels
community_support Issue handled by community members

Comments

@Jainil-Gosalia
Copy link

Describe the issue

I was trying to run graphRAG using llama_cpp. Got the following issue:

❌ create_final_entities
⠼ GraphRAG Indexer
├── Loading Input (text) - 1 files loaded (0 filtered) ━━━━━━ 100% 0:00:… 0:00:…
├── create_base_text_units
├── create_base_extracted_entities
├── create_summarized_entities
├── create_base_entity_graph
None
⠴ GraphRAG Indexer
├── Loading Input (text) - 1 files loaded (0 filtered) ━━━━━━ 100% 0:00:… 0:00:…
├── create_base_text_units
├── create_base_extracted_entities
├── create_summarized_entities
├── create_base_entity_graph
⠴ GraphRAG Indexer
├── Loading Input (text) - 1 files loaded (0 filtered) ━━━━━━ 100% 0:00:… 0:00:…
├── create_base_text_units
├── create_base_extracted_entities
├── create_summarized_entities
├── create_base_entity_graph
└── create_final_entities
❌ Errors occurred during the pipeline run, see logs for more details.

Steps to reproduce

Use the settings.yaml file to replicate the issue

GraphRAG Config Used

The settings.yaml is as follows:

encoding_model: cl100k_base
skip_workflows: []
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_chat # or azure_openai_chat
model: mistral

model_supports_json: false # recommended if this is available for your model.

max_tokens: 4000

request_timeout: 180.0

api_base: http://localhost:8000/v1

api_version: 2024-02-15-preview

organization: <organization_id>

deployment_name: <azure_model_deployment_name>

tokens_per_minute: 150_000 # set a leaky bucket throttle

requests_per_minute: 10_000 # set a leaky bucket throttle

max_retries: 10

max_retry_wait: 10.0

sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times

concurrent_requests: 1 # the number of parallel inflight requests that may be made

parallelization:
stagger: 0.3

num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:

parallelization: override the global parallelization settings for embeddings

async_mode: threaded # or asyncio
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_embedding # or azure_openai_embedding
model: mistral
api_base: http://localhost:8000/v1
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
# tokens_per_minute: 150_000 # set a leaky bucket throttle
# requests_per_minute: 10_000 # set a leaky bucket throttle
# max_retries: 10
# max_retry_wait: 10.0
# sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
# concurrent_requests: 25 # the number of parallel inflight requests that may be made
batch_size: 1 # the number of documents to send in a single request
# batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
# target: required # or optional

chunks:
size: 300
overlap: 100
group_by_columns: [id] # by default, we don't allow chunks to cross documents

input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\.txt$"

cache:
type: file # or blob
base_dir: "cache"

connection_string: <azure_blob_storage_connection_string>

container_name: <azure_blob_storage_container_name>

storage:
type: file # or blob
base_dir: "output/${timestamp}/artifacts"

connection_string: <azure_blob_storage_connection_string>

container_name: <azure_blob_storage_container_name>

reporting:
type: file # or console, blob
base_dir: "output/${timestamp}/reports"

connection_string: <azure_blob_storage_connection_string>

container_name: <azure_blob_storage_container_name>

entity_extraction:

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

prompt: "prompts/entity_extraction.txt"
entity_types: [organization,person,geo,event]
max_gleanings: 0

summarize_descriptions:

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

prompt: "prompts/summarize_descriptions.txt"
max_length: 500

claim_extraction:

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

enabled: true

prompt: "prompts/claim_extraction.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 0

community_report:

llm: override the global llm settings for this task

parallelization: override the global parallelization settings for this task

async_mode: override the global async_mode settings for this task

prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 8000

cluster_graph:
max_cluster_size: 10

embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes

num_walks: 10

walk_length: 40

window_size: 2

iterations: 3

random_seed: 597832

umap:
enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
graphml: false
raw_entities: false
top_level_nodes: false

local_search:

text_unit_prop: 0.5

community_prop: 0.1

conversation_history_max_turns: 5

top_k_mapped_entities: 10

top_k_relationships: 10

max_tokens: 12000

global_search:

max_tokens: 12000

data_max_tokens: 12000

map_max_tokens: 1000

reduce_max_tokens: 2000

concurrency: 32

Logs and screenshots

Indexing Engine Log file shows this:

04:36:19,87 datashaper.workflow.workflow ERROR Error executing verb "text_embed" in create_final_entities: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (16,) + inhomogeneous part.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/datashaper/workflow/workflow.py", line 415, in _execute_verb
result = await result
File "/usr/local/lib/python3.10/dist-packages/graphrag/index/verbs/text/embed/text_embed.py", line 105, in text_embed
return await _text_embed_in_memory(
File "/usr/local/lib/python3.10/dist-packages/graphrag/index/verbs/text/embed/text_embed.py", line 130, in _text_embed_in_memory
result = await strategy_exec(texts, callbacks, cache, strategy_args)
File "/usr/local/lib/python3.10/dist-packages/graphrag/index/verbs/text/embed/strategies/openai.py", line 61, in run
embeddings = await _execute(llm, text_batches, ticker, semaphore)
File "/usr/local/lib/python3.10/dist-packages/graphrag/index/verbs/text/embed/strategies/openai.py", line 105, in _execute
results = await asyncio.gather(*futures)
File "/usr/local/lib/python3.10/dist-packages/graphrag/index/verbs/text/embed/strategies/openai.py", line 100, in embed
result = np.array(chunk_embeddings.output)
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (16,) + inhomogeneous part.
04:36:19,92 graphrag.index.reporting.file_workflow_callbacks INFO Error executing verb "text_embed" in create_final_entities: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (16,) + inhomogeneous part. details=None
04:36:19,96 graphrag.index.run ERROR error running workflow create_final_entities
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/graphrag/index/run.py", line 323, in run_pipeline
result = await workflow.run(context, callbacks)
File "/usr/local/lib/python3.10/dist-packages/datashaper/workflow/workflow.py", line 369, in run
timing = await self._execute_verb(node, context, callbacks)
File "/usr/local/lib/python3.10/dist-packages/datashaper/workflow/workflow.py", line 415, in _execute_verb
result = await result
File "/usr/local/lib/python3.10/dist-packages/graphrag/index/verbs/text/embed/text_embed.py", line 105, in text_embed
return await _text_embed_in_memory(
File "/usr/local/lib/python3.10/dist-packages/graphrag/index/verbs/text/embed/text_embed.py", line 130, in _text_embed_in_memory
result = await strategy_exec(texts, callbacks, cache, strategy_args)
File "/usr/local/lib/python3.10/dist-packages/graphrag/index/verbs/text/embed/strategies/openai.py", line 61, in run
embeddings = await _execute(llm, text_batches, ticker, semaphore)
File "/usr/local/lib/python3.10/dist-packages/graphrag/index/verbs/text/embed/strategies/openai.py", line 105, in _execute
results = await asyncio.gather(*futures)
File "/usr/local/lib/python3.10/dist-packages/graphrag/index/verbs/text/embed/strategies/openai.py", line 100, in embed
result = np.array(chunk_embeddings.output)
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (16,) + inhomogeneous part.
04:36:19,97 graphrag.index.reporting.file_workflow_callbacks INFO Error running pipeline! details=None

Logs.json File shows this:

{"type": "error", "data": "Error executing verb "text_embed" in create_final_entities: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (16,) + inhomogeneous part.", "stack": "Traceback (most recent call last):\n File "/usr/local/lib/python3.10/dist-packages/datashaper/workflow/workflow.py", line 415, in _execute_verb\n result = await result\n File "/usr/local/lib/python3.10/dist-packages/graphrag/index/verbs/text/embed/text_embed.py", line 105, in text_embed\n return await _text_embed_in_memory(\n File "/usr/local/lib/python3.10/dist-packages/graphrag/index/verbs/text/embed/text_embed.py", line 130, in _text_embed_in_memory\n result = await strategy_exec(texts, callbacks, cache, strategy_args)\n File "/usr/local/lib/python3.10/dist-packages/graphrag/index/verbs/text/embed/strategies/openai.py", line 61, in run\n embeddings = await _execute(llm, text_batches, ticker, semaphore)\n File "/usr/local/lib/python3.10/dist-packages/graphrag/index/verbs/text/embed/strategies/openai.py", line 105, in _execute\n results = await asyncio.gather(*futures)\n File "/usr/local/lib/python3.10/dist-packages/graphrag/index/verbs/text/embed/strategies/openai.py", line 100, in embed\n result = np.array(chunk_embeddings.output)\nValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (16,) + inhomogeneous part.\n", "source": "setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (16,) + inhomogeneous part.", "details": null}

{"type": "error", "data": "Error running pipeline!", "stack": "Traceback (most recent call last):\n File "/usr/local/lib/python3.10/dist-packages/graphrag/index/run.py", line 323, in run_pipeline\n result = await workflow.run(context, callbacks)\n File "/usr/local/lib/python3.10/dist-packages/datashaper/workflow/workflow.py", line 369, in run\n timing = await self._execute_verb(node, context, callbacks)\n File "/usr/local/lib/python3.10/dist-packages/datashaper/workflow/workflow.py", line 415, in _execute_verb\n result = await result\n File "/usr/local/lib/python3.10/dist-packages/graphrag/index/verbs/text/embed/text_embed.py", line 105, in text_embed\n return await _text_embed_in_memory(\n File "/usr/local/lib/python3.10/dist-packages/graphrag/index/verbs/text/embed/text_embed.py", line 130, in _text_embed_in_memory\n result = await strategy_exec(texts, callbacks, cache, strategy_args)\n File "/usr/local/lib/python3.10/dist-packages/graphrag/index/verbs/text/embed/strategies/openai.py", line 61, in run\n embeddings = await _execute(llm, text_batches, ticker, semaphore)\n File "/usr/local/lib/python3.10/dist-packages/graphrag/index/verbs/text/embed/strategies/openai.py", line 105, in _execute\n results = await asyncio.gather(*futures)\n File "/usr/local/lib/python3.10/dist-packages/graphrag/index/verbs/text/embed/strategies/openai.py", line 100, in embed\n result = np.array(chunk_embeddings.output)\nValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (16,) + inhomogeneous part.\n", "source": "setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (16,) + inhomogeneous part.", "details": null}

Additional Information

  • GraphRAG Version: v0.1.1
  • Operating System: Ubuntu 20.04
  • Python Version: 3.10.14
  • Related Issues: 442
@Jainil-Gosalia Jainil-Gosalia added the triage Default label assignment, indicates new issue needs reviewed by a maintainer label Jul 18, 2024
@rushizirpe
Copy link

In configuration(yaml), you are using mistral as an embedding model and that might be causing the inhomogeneous dimension. You can use models from nomic-ai or mixedbread.

When I faced the issue, I created a repository for deploying Hugging Face models to local endpoints, offering functionality similar to OpenAI APIs. You can find the repo here: https://github.com/rushizirpe/open-llm-server

Also, I've prepared a Colab notebook for the Graphrag Demo. You might want to take a look: https://colab.research.google.com/drive/1uhFDnih1WKrSRQHisU-L6xw6coapgR51?usp=sharing.
If you don't have access to GPUs like the A100, you'll need a GROQ_API_KEY (which is free with certain limitations), you can obtain it from: https://console.groq.com/keys

@natoverse
Copy link
Collaborator

Consolidating alternate model issues here: #657

@natoverse natoverse closed this as not planned Won't fix, can't repro, duplicate, stale Jul 22, 2024
@natoverse natoverse added community_support Issue handled by community members and removed triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community_support Issue handled by community members
Projects
None yet
Development

No branches or pull requests

3 participants