Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable APM profiling for edxapp #749

Open
robrap opened this issue Jul 29, 2024 · 7 comments
Open

Enable APM profiling for edxapp #749

robrap opened this issue Jul 29, 2024 · 7 comments

Comments

@robrap
Copy link
Contributor

robrap commented Jul 29, 2024

Ultimately, we want to enable APM profiling for edxapp, when we think it is safe.

Notes:

  • We previously tried enabling APM Profiling and Data Streaming but had to revert when we got a large latency issue in Production and the auto-scaling group scaled up to max to try to recover.
    • DD external Slack thread about original issue.
      The 2U Slack thread may be able to be found, if it would be helpful, but I'm guessing it won't because we were just guessing.
    • We thought this might be related to New Relic, but in November (with NR removed) we still had latency issues with profiling.
  • DD support ticket (private): https://help.datadoghq.com/hc/en-us/requests/1909564
  • We need to communicate once this is available.
  • Axim (Dave O and others) are interested in anything we might learn from this feature for edx-platform performance improvements that might go on the roadmap.
@robrap robrap added this to Arch-BOM Jul 29, 2024
@robrap robrap converted this from a draft issue Jul 29, 2024
@robrap robrap moved this to Backlog in Arch-BOM Jul 29, 2024
@robrap robrap changed the title Enable APM profiling for edxapp [Post-NR] Enable APM profiling for edxapp Aug 21, 2024
@robrap robrap changed the title [Post-NR] Enable APM profiling for edxapp Enable APM profiling for edxapp Aug 21, 2024
@robrap
Copy link
Contributor Author

robrap commented Oct 16, 2024

We should roll out to Stage, then Edge, then Prod.

@jristau1984 jristau1984 moved this from Backlog to Ready For Development in Arch-BOM Oct 21, 2024
@dianakhuang dianakhuang self-assigned this Oct 21, 2024
@dianakhuang dianakhuang moved this from Ready For Development to In Progress in Arch-BOM Oct 21, 2024
@timmc-edx
Copy link
Member

DD support ticket for latency issues we encountered during the most recent rollout attempt: https://help.datadoghq.com/hc/requests/1909564

dianakhuang added a commit to edx/configuration that referenced this issue Nov 12, 2024
It seems like the newer version might be more efficient,
so we should switch to using it.

edx/edx-arch-experiments#749
@timmc-edx
Copy link
Member

timmc-edx commented Nov 25, 2024

I think I've managed to repro slow gunicorn startup on a sandbox instance.

Profiling setup

Added to /edx/app/edxapp/lms.sh and worker.sh (though the latter shouldn't matter for gunicorn):

export DD_PROFILING_ENABLED=true
export DD_PROFILING_STACK_V2_ENABLED=true
export DD_PROFILING_TIMELINE_ENABLED=true

And then:

/edx/bin/supervisorctl restart lms

(Can also restart workers with /edx/bin/supervisorctl restart edxapp_worker:lms_high_1 edxapp_worker:lms_high_mem_1 edxapp_worker:lms_default_1 but not needed for gunicorn experiment.)

To get DD profiling data on both sides, pushed buttons in instructor dashboard and made calls to https://timmc.sandbox.edx.org/heartbeat?extended -- data shows up under host:timmc (no env tag is set, unfortunately).

Gunicorn repro

In a dev terminal, make short HTTP calls to the LMS 1-2 times per second: while true; do curl -sS "https://timmc.sandbox.edx.org/heartbeat" -m1; sleep 0.5; done and wait about 10 seconds before proceeding with data-gathering. This can be left on continuously and does not need to be stopped between tests.

For each config:

  • Edit the LMS configuration.
  • In a root sandbox terminal, restart the LMS with /edx/bin/supervisorctl restart lms and wait about 10 seconds.
  • For each iteration:
    • Restart LMS and pay attention to when the lms: started message appears to help orient yourself in the nginx logs.
    • In /edx/var/log/nginx/access.log, record the startup time as indicated in the Evaluation section below.
    • Wait about 30 seconds
  • Perform 3 iterations of this to get enough samples (depending on observed variance).

nginx output will look something like this:

3.220.104.68 - - [25/Nov/2024:20:10:59 +0000] "GET /heartbeat HTTP/1.1" 200 122 0.021 "-" "curl/8.5.0" "-" - 0aff3657b83c7779b9a48d87ad185c60
3.220.104.68 - - [25/Nov/2024:20:11:00 +0000] "GET /heartbeat HTTP/1.1" 200 122 0.022 "-" "curl/8.5.0" "-" - 7f8c15ddced976c7199af3f53bde94d3
3.220.104.68 - - [25/Nov/2024:20:11:01 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.824 "-" "curl/8.5.0" "-" - 4d985f61d866cc4b470ac3dbd30b5d2d
3.220.104.68 - - [25/Nov/2024:20:11:03 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.879 "-" "curl/8.5.0" "-" - 173a7f0d30a37266976f45c36205fb37
3.220.104.68 - - [25/Nov/2024:20:11:04 +0000] "GET /heartbeat HTTP/1.1" 503 5416 0.506 "-" "curl/8.5.0" "-" - 0475736b183c4d12b8aaf294a94af859
3.220.104.68 - - [25/Nov/2024:20:11:05 +0000] "GET /heartbeat HTTP/1.1" 503 5416 0.028 "-" "curl/8.5.0" "-" - 0e4b1d04ac591ef74044156e14b11535
3.220.104.68 - - [25/Nov/2024:20:11:05 +0000] "GET /heartbeat HTTP/1.1" 503 5416 0.026 "-" "curl/8.5.0" "-" - d57d0b1f8d67ae64a1c46d53415d6d7b
3.220.104.68 - - [25/Nov/2024:20:11:07 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.842 "-" "curl/8.5.0" "-" - 87c9f84bc24ab93d19dd08b2f7ff3bc9
3.220.104.68 - - [25/Nov/2024:20:11:09 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.884 "-" "curl/8.5.0" "-" - 82a15a9f7e6feb753a0d40741236af89
3.220.104.68 - - [25/Nov/2024:20:11:10 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.860 "-" "curl/8.5.0" "-" - f389d7c7fd36ee272b1781f18d0f96da
3.220.104.68 - - [25/Nov/2024:20:11:12 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.842 "-" "curl/8.5.0" "-" - bb414172da1c9114cff47db64b4a9adb
3.220.104.68 - - [25/Nov/2024:20:11:13 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.848 "-" "curl/8.5.0" "-" - e8bc229b5746b14b11386db26be4e2ef
3.220.104.68 - - [25/Nov/2024:20:11:15 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.830 "-" "curl/8.5.0" "-" - 7a9170e8c2e447af974d1a6f21192d76
3.220.104.68 - - [25/Nov/2024:20:11:16 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.378 "-" "curl/8.5.0" "-" - b3fc5ba259efe44223fff8984d15e9df
3.220.104.68 - - [25/Nov/2024:20:11:18 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.829 "-" "curl/8.5.0" "-" - 066e305fe45a453c7ffc9f7dd5c67163
3.220.104.68 - - [25/Nov/2024:20:11:19 +0000] "GET /heartbeat HTTP/1.1" 200 122 0.207 "-" "curl/8.5.0" "-" - 35a8413f354269347e6a3dca170eb5f8
3.220.104.68 - - [25/Nov/2024:20:11:19 +0000] "GET /heartbeat HTTP/1.1" 200 122 0.019 "-" "curl/8.5.0" "-" - cb087a69b565042aa3dc6f648e4f4736

The initial transition of 200 -> 499 and then 499 -> 503 occurs during LMS shutdown. 503 -> 499 transition co-occurs with the lms: started message from the supervisor, and 499 -> 200 is when curl starts getting responses again.

For comparison, here's /edx/var/log/supervisor/lms-stderr.log around that time period:

[2024-11-25 20:11:06 +0000] [973414] [INFO] Starting gunicorn 23.0.0
[2024-11-25 20:11:06 +0000] [973414] [INFO] Listening at: http://127.0.0.1:8000 (973414)
[2024-11-25 20:11:06 +0000] [973414] [INFO] Using worker: sync
[2024-11-25 20:11:06 +0000] [973422] [INFO] Booting worker with pid: 973422
[2024-11-25 20:11:06 +0000] [973426] [INFO] Booting worker with pid: 973426
[2024-11-25 20:11:16 +0000] [973426] [INFO] GET /heartbeat
[2024-11-25 20:11:17 +0000] [973422] [INFO] GET /heartbeat
[2024-11-25 20:11:18 +0000] [973426] [INFO] GET /heartbeat
[2024-11-25 20:11:18 +0000] [973426] [INFO] GET /heartbeat
[2024-11-25 20:11:18 +0000] [973426] [INFO] GET /heartbeat
[2024-11-25 20:11:18 +0000] [973426] [INFO] GET /heartbeat
[2024-11-25 20:11:18 +0000] [973426] [INFO] GET /heartbeat
[2024-11-25 20:11:19 +0000] [973426] [INFO] GET /heartbeat
[2024-11-25 20:11:19 +0000] [973426] [INFO] GET /heartbeat
[2024-11-25 20:11:19 +0000] [973422] [INFO] GET /heartbeat
[2024-11-25 20:11:20 +0000] [973422] [INFO] GET /heartbeat
[2024-11-25 20:11:21 +0000] [973422] [INFO] GET /heartbeat
[2024-11-25 20:11:22 +0000] [973422] [INFO] GET /heartbeat

In this sample, it appears that those calls that were recorded as a 499 did eventually get received by the LMS and were all processed in a burst about 10 seconds after workers actually started.

Evaluation

After the 503s end: Find the number of seconds from the first 499 to the first 200. This is the "startup period".

@timmc-edx
Copy link
Member

timmc-edx commented Nov 25, 2024

Experiments

Profiling off

With profiling off (no profiling-related settings), the startup period lasts 12 seconds.

Repro

With the below profiling config, which is what we most recently used in the stage environment, the startup period lasts 20 seconds.

export DD_PROFILING_ENABLED=true
export DD_PROFILING_STACK_V2_ENABLED=true
export DD_PROFILING_TIMELINE_ENABLED=true

Baselines

Just with profiling enabled, nothing else:

export DD_PROFILING_ENABLED=true

19 seconds (with one 499 a few seconds after the first 200s); 18; 18


Profiling, but v2 stack:

export DD_PROFILING_ENABLED=true
export DD_PROFILING_STACK_V2_ENABLED=true

21; 22; 21

Experiment design

I'll keep this disabled for now since it's not needed for repro, and since we'll probably only want to use it when we want to actually look at the generated profiles:

  • DD_PROFILING_TIMELINE_ENABLED

To experiment with:

  • DD_PROFILING_API_TIMEOUT which defaults to 10 seconds -- try shortening this to 1 second.
  • DD_PROFILING_CAPTURE_PCT defaults to 1.0 -- try 0.1 or 10.
  • These default to true; try turning them off:
    • DD_PROFILING_ENABLE_CODE_PROVENANCE
    • DD_PROFILING_ENDPOINT_COLLECTION_ENABLED
    • DD_PROFILING_STACK_ENABLED
    • DD_PROFILING_MEMORY_ENABLED
    • DD_PROFILING_LOCK_ENABLED
    • DD_PROFILING_LOCK_NAME_INSPECT_DIR (only relevant when lock profiling is enabled)
    • DD_PROFILING_HEAP_ENABLED
  • These are off by default; try turning them on:
    • DD_PROFILING_STACK_V2_ENABLED (only relevant stack profiling is enabled)
    • DD_PROFILING_EXPORT_LIBDD_ENABLED

@timmc-edx
Copy link
Member

timmc-edx commented Nov 25, 2024

With a baseline of export DD_PROFILING_ENABLED=true and export DD_PROFILING_STACK_V2_ENABLED=true (since v2 is what DD wants everyone to switch to soon anyhow)...

  • DD_PROFILING_API_TIMEOUT=1: 19, 20, 21
  • DD_PROFILING_CAPTURE_PCT=0.1: 21, 19, 21

On to the toggles...


Turning every profiling feature off (except for profiling itself) gets to the "good" situation:

export DD_PROFILING_STACK_ENABLED=false
export DD_PROFILING_MEMORY_ENABLED=false
export DD_PROFILING_HEAP_ENABLED=false
export DD_PROFILING_ENABLE_CODE_PROVENANCE=false
export DD_PROFILING_ENDPOINT_COLLECTION_ENABLED=false
export DD_PROFILING_LOCK_ENABLED=false

11, 9, 11


export DD_PROFILING_ENABLE_CODE_PROVENANCE=false
export DD_PROFILING_ENDPOINT_COLLECTION_ENABLED=false
export DD_PROFILING_LOCK_ENABLED=false

19, 19


export DD_PROFILING_STACK_ENABLED=false
export DD_PROFILING_MEMORY_ENABLED=false
export DD_PROFILING_HEAP_ENABLED=false

12, 11, 11


export DD_PROFILING_STACK_ENABLED=false

18, 17, 17


export DD_PROFILING_HEAP_ENABLED=false

16, 16, 16


export DD_PROFILING_MEMORY_ENABLED=false

14, 13, 13


export DD_PROFILING_MEMORY_ENABLED=false
export DD_PROFILING_HEAP_ENABLED=false

13, 15, 15

@timmc-edx
Copy link
Member

More experiments...


Profiling on (but v2 stack not enabled), memory disabled:

export DD_PROFILING_ENABLED=true
export DD_PROFILING_MEMORY_ENABLED=false

13, 12, 13

@robrap robrap moved this from In Progress to Blocked in Arch-BOM Dec 2, 2024
@timmc-edx
Copy link
Member

timmc-edx commented Dec 6, 2024

Also able to reproduce this on devstack.

Setup

  1. Check out timmc/datadog-local-testing in devstack and follow the usual devstack-datadog setup
  2. Adjust datadog/wrap-datadog.sh with profiling settings as below.
  3. Change datadog/lms-server.sh to use gunicorn instead of runserver:
    SERVICE_VARIANT=lms
    SERVICE_PORT=18000
    
    export DJANGO_SETTINGS_MODULE=${SERVICE_VARIANT}.envs.devstack
    gunicorn \
        -c /edx/app/edxapp/edx-platform/${SERVICE_VARIANT}/docker_${SERVICE_VARIANT}_gunicorn.py \
        --name ${SERVICE_VARIANT} \
        --bind=0.0.0.0:${SERVICE_PORT} \
        --max-requests=1000 \
        --access-logfile \
        - ${SERVICE_VARIANT}.wsgi:application
    
  4. Change lms/docker_lms_gunicorn.py to use workers = 3

Measurements

Baseline:

[11, 15, 14, 14] seconds from "Booting worker" to first GET 200 in logs — 13.4 seconds geometric mean

Profiling enabled:

export DD_PROFILING_ENABLED=true
export DD_PROFILING_STACK_V2_ENABLED=true

[27, 32, 31, 28] — 29.4 seconds geometric mean

Profiling, but not memory:

export DD_PROFILING_ENABLED=true
export DD_PROFILING_STACK_V2_ENABLED=true
export DD_PROFILING_MEMORY_ENABLED=false

[19, 19, 20, 15, 20] — 18.5 seconds geometric mean

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Blocked
Development

No branches or pull requests

3 participants