[Bug]: JSON Decode Error #471

KylinMountain · 2024-07-10T04:40:23Z

Describe the bug

When I try to search using global query, it report json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0).

And the response missing a paragraph which is not generated with JSON formatted.

Steps to reproduce

using gemma 9b as llm model as confined as the following.

poetry run poe index --root .
poetry run poe query --root . --method global "What is this story mainly tell"

Expected Behavior

It should be able to query successfully and not miss any content which is not formatted by JSON.

GraphRAG Config Used

encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GROQ_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: gemma2-9b-it
  model_supports_json: false # recommended if this is available for your model.
  api_base: https://api.groq.com/openai/v1
  max_tokens: 2000
  concurrent_requests: 1 # the number of parallel inflight requests that may be made
  tokens_per_minute: 15000 # set a leaky bucket throttle
  requests_per_minute: 30 # set a leaky bucket throttle
  top_p: 0.99
  # request_timeout: 180.0
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  max_retries: 3
  max_retry_wait: 10
  sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times


parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: text-embedding-ada-002
    api_base: http://localhost:8080
    batch_size: 1 # the number of documents to send in a single request
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional

chunks:
  size: 300
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 0

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 0

community_reports:
  ## llm: override the global llm configuration
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # max_tokens: 12000

global_search:
   max_tokens: 2000
   top_p: 0.99
   temperature: 0
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

Logs and screenshots

Additional Information

GraphRAG Version: latest main branch.
Operating System: Mac
Python Version: 3.10
Related Issues:

The text was updated successfully, but these errors were encountered:

KylinMountain · 2024-07-10T04:42:03Z

@AlonsoGuevara
Look into the prompt of MAP_SYSTEM_PROMPT of global query, I think we should optimize it, the example is not clearly, which may cause it won't generate JSON reply in stable. I will try to pull a request to do this.

Ech0riginal · 2024-07-11T17:30:45Z

I initially had problems with the models outputting more than just JSON, un-handy descriptors or irrelevant information, such as "Here is the JSON you requested: etc..". Modified the json scrubbing in llm/openai/_json.py to shrink the string into the first and last brackets found. Likely error-prone on larger datasets and/or models other than what I'm using, but it works well enough for what I need at the moment:

def clean_up_json(json_str: str) -> str:
    """Clean up json string."""

    json_str = (
        json_str.replace("\\n", "")
        .replace("\n", "")
        .replace("\r", "")
        .replace('"[{', "[{")
        .replace('}]"', "}]")
        .replace("\\", "")
        .strip()
    )

    # Constrain string to brackets regardless of content
    open_brackets = __find_all(json_str, "{")
    close_brackets = __find_all(json_str, "}")
    json_str = json_str[min(open_brackets):max(close_brackets)+1]

    # Remove JSON Markdown Frame
    if json_str.startswith("```json"):
        json_str = json_str[len("```json") :]
    if json_str.endswith("```"):
        json_str = json_str[: len(json_str) - len("```")]

    return json_str

# https://stackoverflow.com/a/4665027
def __find_all(string, substring):
    start = 0
    while True:
        start = string.find(substring, start)
        if start == -1: return
        yield start
        start += len(substring)

vv111y · 2024-07-12T14:06:48Z

I'm hitting this bug.

For the proposed PR what do I need to know to use it?
~~Are there particular models that should be selected (aside from gpt4)?~~ Share my config to change to your local LLM and embedding #374
Will the PR changes allow for graceful failure wherein if the generated output is not quit valid json it can still be used?
Perhaps I could do some checks and cleanup after receiving results? Just asking for some guidance, thanks

Update:
meta-llama/Llama-3-8b-chat-hf : causes this error.
Qwen/Qwen2-72B-Instruct : empty response, no error reported.

Should I open another issue?

jaigouk · 2024-07-14T01:06:50Z

I was able to use Claude 3 haiku with my proxy(https://github.com/jaigouk/claude-proxy-api) on k3s with 3 replica and graphrag/query/structured_search/global_search/search.py needs to be improved as @Ech0riginal said.
I tested my setup and I was able to get the response with global search.

#545

My settings.yaml is

llm:
  api_key: ${CLAUDE_PROXY_API_KEY}
  type: openai_chat
  model_supports_json: true
  model: "claude-3-haiku-20240307"
  api_base: "http://192.168.8.213:30012/v1"
  # max_tokens: 10000 # Adjusted based on Claude 3 Haiku's typical context window
  request_timeout: 30
  tokens_per_minute: 100000
  requests_per_minute: 1000
  max_retry_wait: 5
  temperature: 0.1
  
  embeddings:
  async_mode: threaded
  llm:
    api_key: ${EMBEDDING_API_KEY}
    type: openai_embedding
    model: "BAAI/bge-m3"
    api_base: "http://localhost:7997"

I am using https://github.com/michaelfeil/infinity for embeddings. with 8k text tokens, the indexing time takes about 1 min. without model_supports_json: true it takes about 3 min.

natoverse · 2024-07-23T21:04:38Z

Consolidating alternate model issues here: #657

KylinMountain added bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Jul 10, 2024

KylinMountain mentioned this issue Jul 10, 2024

[bug fix]Optimize map prompt of global query to output JSON stably #473

Open

3 tasks

This was referenced Jul 14, 2024

improve parsing json in global search jaigouk/graphrag#1

Closed

[bug fix] improved json mode in global search #545

Open

natoverse closed this as not planned Won't fix, can't repro, duplicate, stale Jul 23, 2024

natoverse added community_support Issue handled by community members and removed bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: JSON Decode Error #471

[Bug]: JSON Decode Error #471

KylinMountain commented Jul 10, 2024

KylinMountain commented Jul 10, 2024 •

edited

Loading

Ech0riginal commented Jul 11, 2024

vv111y commented Jul 12, 2024 •

edited

Loading

jaigouk commented Jul 14, 2024 •

edited

Loading

natoverse commented Jul 23, 2024

[Bug]: JSON Decode Error #471

[Bug]: JSON Decode Error #471

Comments

KylinMountain commented Jul 10, 2024

Describe the bug

Steps to reproduce

Expected Behavior

GraphRAG Config Used

Logs and screenshots

Additional Information

KylinMountain commented Jul 10, 2024 • edited Loading

Ech0riginal commented Jul 11, 2024

vv111y commented Jul 12, 2024 • edited Loading

jaigouk commented Jul 14, 2024 • edited Loading

natoverse commented Jul 23, 2024

KylinMountain commented Jul 10, 2024 •

edited

Loading

vv111y commented Jul 12, 2024 •

edited

Loading

jaigouk commented Jul 14, 2024 •

edited

Loading