Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: intermittent segfault on rag pipeline #1291

Closed
dagardner-nv opened this issue Oct 18, 2023 · 1 comment
Closed

[BUG]: intermittent segfault on rag pipeline #1291

dagardner-nv opened this issue Oct 18, 2023 · 1 comment
Assignees
Labels
bug Something isn't working sherlock Issues/PRs related to Sherlock workflows and components

Comments

@dagardner-nv
Copy link
Contributor

Version

23.11

Which installation method(s) does this occur on?

Source

Describe the bug.

I've seen this intermittently this on startup.

Minimum reproducible example

python examples/llm/main.py --log_level=debug rag pipeline


### Relevant log output

<details><summary>Click here to see error details</summary><pre>

====Registering Pipeline====
====Building Pipeline====
====Building Pipeline Complete!====            
Creating loop                                  
Starting! Time: 1697670092.2514627
Creating loop                                  
====Registering Pipeline Complete!====
Setting loop current                           
====Starting Pipeline====
Running forever                                
====Building Segment: linear_segment_0====
Source rate: 0 questions [00:00, ? questions/s]Setting loop current
Running forever
Added source: <from-mem-0; InMemorySourceStage(dataframes=[                                 questions
0  Tell me a story about your best friend.
1  Tell me a story about your best friend.
2  Tell me a story about your best friend.
3  Tell me a story about your best friend.
4  Tell me a story about your best friend.], repeat=10)>
  └─> morpheus.MessageMeta
Added stage: <deserialize-1; DeserializeStage(ensure_sliceable_index=True, message_type=<class 'morpheus._lib.messages.ControlMessage'>, task_type=llm_engine, task_payload={'task_type': 'completion', 'task_dict': {'input_keys': ['questions']}})>
  └─ morpheus.MessageMeta -> morpheus.ControlMessage
Added stage: <monitor-2; MonitorStage(description=Source rate, smoothing=0.05, unit=questions, delayed_start=False, determine_count_fn=None, log_level=LogLevels.INFO)>
  └─ morpheus.ControlMessage -> morpheus.ControlMessage
Added stage: <llm-engine-3; LLMEngineStage(engine=<morpheus._lib.llm.LLMEngine object at 0x7f00d8382a30>)>
  └─ morpheus.ControlMessage -> morpheus.ControlMessage
Added stage: <to-mem-4; InMemorySinkStage()>   
  └─ morpheus.ControlMessage -> morpheus.ControlMessage
Added stage: <monitor-5; MonitorStage(description=Upload rate, smoothing=0.05, unit=events, delayed_start=True, determine_count_fn=None, log_level=LogLevels.INFO)>
  └─ morpheus.ControlMessage -> morpheus.ControlMessage
====Building Segment Complete!====                  
====Pipeline Started====                            
Source rate: 5 questions [00:00, 387.74 questions/s]*** Aborted at 1697670092 (unix time) try "date -d @1697670092" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 331095 (TID 0x7efeb3fff6c0) from PID 0; stack trace: ***
    @     0x7f0217218197 google::(anonymous namespace)::FailureSignalHandler()
    @     0x7f02bda81fd0 (unknown)
    @     0x7f01a000cee2 boost::fibers::wait_queue::notify_one()
    @     0x7f01a000d53f boost::fibers::mutex::unlock()
    @     0x7f01512ee2d0 std::unique_lock<>::unlock()
    @     0x7f01512ee5de std::unique_lock<>::~unique_lock()
    @     0x7f01512dd53c boost::fibers::detail::shared_state_base::wait()
    @     0x7f01512fd45c boost::fibers::detail::future_base<>::wait()
    @     0x7f015136b07d _ZZN8morpheus3llm18BoostFutureAwaiterIFN3mrc7channel6StatusERSt10shared_ptrINS_14ControlMessageEEEE7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEEENKUlSE_E_clESE_
    @     0x7f0151376170 _ZSt13__invoke_implIvRZN8morpheus3llm18BoostFutureAwaiterIFN3mrc7channel6StatusERSt10shared_ptrINS0_14ControlMessageEEEE7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEEEUlSF_E_JSF_EET_St14__invoke_otherOT0_DpOT1_
    @     0x7f0151375a31 _ZSt8__invokeIRZN8morpheus3llm18BoostFutureAwaiterIFN3mrc7channel6StatusERSt10shared_ptrINS0_14ControlMessageEEEE7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEEEUlSF_E_JSF_EENSt15__invoke_resultIT_JDpT0_EE4typeEOSJ_DpOSK_
    @     0x7f0151374639 _ZSt12__apply_implIRZN8morpheus3llm18BoostFutureAwaiterIFN3mrc7channel6StatusERSt10shared_ptrINS0_14ControlMessageEEEE7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEEEUlSF_E_St5tupleIJSF_EEJLm0EEEDcOT_OT0_St16integer_sequenceImJXspT1_EEE
    @     0x7f0151374677 _ZSt5applyIRZN8morpheus3llm18BoostFutureAwaiterIFN3mrc7channel6StatusERSt10shared_ptrINS0_14ControlMessageEEEE7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEEEUlSF_E_St5tupleIJSF_EEEDcOT_OT0_
    @     0x7f01513746d3 _ZN5boost6fibers6detail11task_objectIZN8morpheus3llm18BoostFutureAwaiterIFN3mrc7channel6StatusERSt10shared_ptrINS3_14ControlMessageEEEE7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEEEUlSI_E_SaINS0_13packaged_taskIFvSI_EEEEvJSI_EE3runEOSI_
    @     0x7f015135759e boost::fibers::packaged_task<>::operator()()
    @     0x7f0151355c0e std::__invoke_impl<>()
    @     0x7f015135220a std::__invoke<>()
    @     0x7f015134d127 _ZSt12__apply_implIN5boost6fibers13packaged_taskIFvNSt7__n486116coroutine_handleIvEEEEESt5tupleIJS5_EEJLm0EEEDcOT_OT0_St16integer_sequenceImJXspT1_EEE
    @     0x7f015134d165 _ZSt5applyIN5boost6fibers13packaged_taskIFvNSt7__n486116coroutine_handleIvEEEEESt5tupleIJS5_EEEDcOT_OT0_
    @     0x7f015134d207 boost::fibers::worker_context<>::run_()
    @     0x7f015135c9e7 std::__invoke_impl<>()
    @     0x7f015135c70b std::__invoke<>()
    @     0x7f015135c287 _ZNSt5_BindIFMN5boost6fibers14worker_contextINS1_13packaged_taskIFvNSt7__n486116coroutine_handleIvEEEEEJS6_EEEFNS0_7context5fiberEOSB_EPS9_St12_PlaceholderILi1EEEE6__callISB_JSC_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
    @     0x7f015135bde6 std::_Bind<>::operator()<>()
    @     0x7f015135b50b std::__invoke_impl<>()
    @     0x7f015135a811 std::__invoke<>()
    @     0x7f01513585ef std::invoke<>()
    @     0x7f01513576c2 boost::context::detail::fiber_record<>::run()
    @     0x7f0151355e0a boost::context::detail::fiber_entry<>()
Source rate[Complete]: 50 questions [00:00, 481.87 questions/s]    @     0x7f0214edb11f make_fcontext
Segmentation fault

</pre></details>


### Full env printout

<details><summary>Click here to see environment details</summary><pre>

 [Paste the results of print_env.sh here, it will be hidden by default]

</pre></details>


### Other/Misc.

_No response_

### Code of Conduct

- [X] I agree to follow Morpheus' Code of Conduct
- [X] I have searched the [open bugs](https://github.com/nv-morpheus/Morpheus/issues?q=is%3Aopen+is%3Aissue+label%3Abug) and have found no duplicates for this bug report
@dagardner-nv dagardner-nv added the bug Something isn't working label Oct 18, 2023
@dagardner-nv dagardner-nv self-assigned this Oct 18, 2023
@dagardner-nv dagardner-nv added the sherlock Issues/PRs related to Sherlock workflows and components label Oct 18, 2023
dagardner-nv added a commit to dagardner-nv/Morpheus that referenced this issue Oct 18, 2023
@dagardner-nv
Copy link
Contributor Author

Fix for this is:

diff --git a/examples/llm/common/llm_engine_stage.py b/examples/llm/common/llm_engine_stage.py
index 810059ecd..b38e9d251 100644
--- a/examples/llm/common/llm_engine_stage.py
+++ b/examples/llm/common/llm_engine_stage.py
@@ -71,7 +71,7 @@ class LLMEngineStage(SinglePortStage):
     def _build_single(self, builder: mrc.Builder, input_stream: StreamPair) -> StreamPair:
 
         node = _llm.LLMEngineStage(builder, self.unique_name, self._engine)
-        node.launch_options.pe_count = 2
+        node.launch_options.pe_count = 1
 
         builder.make_edge(input_stream[0], node)

@github-project-automation github-project-automation bot moved this from Todo to Done in Morpheus Boards Oct 19, 2023
@dagardner-nv dagardner-nv reopened this Oct 19, 2023
@github-project-automation github-project-automation bot moved this from Done to In Progress in Morpheus Boards Oct 19, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in Morpheus Boards Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working sherlock Issues/PRs related to Sherlock workflows and components
Projects
Status: Done
Development

No branches or pull requests

1 participant