Enable APM profiling for edxapp #749

robrap · 2024-07-29T16:51:28Z

Ultimately, we want to enable APM profiling for edxapp, when we think it is safe.

Notes:

We previously tried enabling APM Profiling and Data Streaming but had to revert when we got a large latency issue in Production and the auto-scaling group scaled up to max to try to recover.
- DD external Slack thread about original issue.
  The 2U Slack thread may be able to be found, if it would be helpful, but I'm guessing it won't because we were just guessing.
- We thought this might be related to New Relic, but in November (with NR removed) we still had latency issues with profiling.
DD support ticket (private): https://help.datadoghq.com/hc/en-us/requests/1909564
We need to communicate once this is available.
Axim (Dave O and others) are interested in anything we might learn from this feature for edx-platform performance improvements that might go on the roadmap.

robrap · 2024-10-16T19:17:31Z

We should roll out to Stage, then Edge, then Prod.

timmc-edx · 2024-10-30T21:49:41Z

DD support ticket for latency issues we encountered during the most recent rollout attempt: https://help.datadoghq.com/hc/requests/1909564

It seems like the newer version might be more efficient, so we should switch to using it. edx/edx-arch-experiments#749

timmc-edx · 2024-11-25T20:17:34Z

I think I've managed to repro slow gunicorn startup on a sandbox instance.

Profiling setup

Added to /edx/app/edxapp/lms.sh and worker.sh (though the latter shouldn't matter for gunicorn):

export DD_PROFILING_ENABLED=true
export DD_PROFILING_STACK_V2_ENABLED=true
export DD_PROFILING_TIMELINE_ENABLED=true

And then:

/edx/bin/supervisorctl restart lms

(Can also restart workers with /edx/bin/supervisorctl restart edxapp_worker:lms_high_1 edxapp_worker:lms_high_mem_1 edxapp_worker:lms_default_1 but not needed for gunicorn experiment.)

To get DD profiling data on both sides, pushed buttons in instructor dashboard and made calls to https://timmc.sandbox.edx.org/heartbeat?extended -- data shows up under host:timmc (no env tag is set, unfortunately).

Gunicorn repro

In a dev terminal, make short HTTP calls to the LMS 1-2 times per second: while true; do curl -sS "https://timmc.sandbox.edx.org/heartbeat" -m1; sleep 0.5; done and wait about 10 seconds before proceeding with data-gathering. This can be left on continuously and does not need to be stopped between tests.

For each config:

Edit the LMS configuration.
In a root sandbox terminal, restart the LMS with /edx/bin/supervisorctl restart lms and wait about 10 seconds.
For each iteration:
- Restart LMS and pay attention to when the lms: started message appears to help orient yourself in the nginx logs.
- In /edx/var/log/nginx/access.log, record the startup time as indicated in the Evaluation section below.
- Wait about 30 seconds
Perform 3 iterations of this to get enough samples (depending on observed variance).

nginx output will look something like this:

3.220.104.68 - - [25/Nov/2024:20:10:59 +0000] "GET /heartbeat HTTP/1.1" 200 122 0.021 "-" "curl/8.5.0" "-" - 0aff3657b83c7779b9a48d87ad185c60
3.220.104.68 - - [25/Nov/2024:20:11:00 +0000] "GET /heartbeat HTTP/1.1" 200 122 0.022 "-" "curl/8.5.0" "-" - 7f8c15ddced976c7199af3f53bde94d3
3.220.104.68 - - [25/Nov/2024:20:11:01 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.824 "-" "curl/8.5.0" "-" - 4d985f61d866cc4b470ac3dbd30b5d2d
3.220.104.68 - - [25/Nov/2024:20:11:03 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.879 "-" "curl/8.5.0" "-" - 173a7f0d30a37266976f45c36205fb37
3.220.104.68 - - [25/Nov/2024:20:11:04 +0000] "GET /heartbeat HTTP/1.1" 503 5416 0.506 "-" "curl/8.5.0" "-" - 0475736b183c4d12b8aaf294a94af859
3.220.104.68 - - [25/Nov/2024:20:11:05 +0000] "GET /heartbeat HTTP/1.1" 503 5416 0.028 "-" "curl/8.5.0" "-" - 0e4b1d04ac591ef74044156e14b11535
3.220.104.68 - - [25/Nov/2024:20:11:05 +0000] "GET /heartbeat HTTP/1.1" 503 5416 0.026 "-" "curl/8.5.0" "-" - d57d0b1f8d67ae64a1c46d53415d6d7b
3.220.104.68 - - [25/Nov/2024:20:11:07 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.842 "-" "curl/8.5.0" "-" - 87c9f84bc24ab93d19dd08b2f7ff3bc9
3.220.104.68 - - [25/Nov/2024:20:11:09 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.884 "-" "curl/8.5.0" "-" - 82a15a9f7e6feb753a0d40741236af89
3.220.104.68 - - [25/Nov/2024:20:11:10 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.860 "-" "curl/8.5.0" "-" - f389d7c7fd36ee272b1781f18d0f96da
3.220.104.68 - - [25/Nov/2024:20:11:12 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.842 "-" "curl/8.5.0" "-" - bb414172da1c9114cff47db64b4a9adb
3.220.104.68 - - [25/Nov/2024:20:11:13 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.848 "-" "curl/8.5.0" "-" - e8bc229b5746b14b11386db26be4e2ef
3.220.104.68 - - [25/Nov/2024:20:11:15 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.830 "-" "curl/8.5.0" "-" - 7a9170e8c2e447af974d1a6f21192d76
3.220.104.68 - - [25/Nov/2024:20:11:16 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.378 "-" "curl/8.5.0" "-" - b3fc5ba259efe44223fff8984d15e9df
3.220.104.68 - - [25/Nov/2024:20:11:18 +0000] "GET /heartbeat HTTP/1.1" 499 0 0.829 "-" "curl/8.5.0" "-" - 066e305fe45a453c7ffc9f7dd5c67163
3.220.104.68 - - [25/Nov/2024:20:11:19 +0000] "GET /heartbeat HTTP/1.1" 200 122 0.207 "-" "curl/8.5.0" "-" - 35a8413f354269347e6a3dca170eb5f8
3.220.104.68 - - [25/Nov/2024:20:11:19 +0000] "GET /heartbeat HTTP/1.1" 200 122 0.019 "-" "curl/8.5.0" "-" - cb087a69b565042aa3dc6f648e4f4736

The initial transition of 200 -> 499 and then 499 -> 503 occurs during LMS shutdown. 503 -> 499 transition co-occurs with the lms: started message from the supervisor, and 499 -> 200 is when curl starts getting responses again.

For comparison, here's /edx/var/log/supervisor/lms-stderr.log around that time period:

[2024-11-25 20:11:06 +0000] [973414] [INFO] Starting gunicorn 23.0.0
[2024-11-25 20:11:06 +0000] [973414] [INFO] Listening at: http://127.0.0.1:8000 (973414)
[2024-11-25 20:11:06 +0000] [973414] [INFO] Using worker: sync
[2024-11-25 20:11:06 +0000] [973422] [INFO] Booting worker with pid: 973422
[2024-11-25 20:11:06 +0000] [973426] [INFO] Booting worker with pid: 973426
[2024-11-25 20:11:16 +0000] [973426] [INFO] GET /heartbeat
[2024-11-25 20:11:17 +0000] [973422] [INFO] GET /heartbeat
[2024-11-25 20:11:18 +0000] [973426] [INFO] GET /heartbeat
[2024-11-25 20:11:18 +0000] [973426] [INFO] GET /heartbeat
[2024-11-25 20:11:18 +0000] [973426] [INFO] GET /heartbeat
[2024-11-25 20:11:18 +0000] [973426] [INFO] GET /heartbeat
[2024-11-25 20:11:18 +0000] [973426] [INFO] GET /heartbeat
[2024-11-25 20:11:19 +0000] [973426] [INFO] GET /heartbeat
[2024-11-25 20:11:19 +0000] [973426] [INFO] GET /heartbeat
[2024-11-25 20:11:19 +0000] [973422] [INFO] GET /heartbeat
[2024-11-25 20:11:20 +0000] [973422] [INFO] GET /heartbeat
[2024-11-25 20:11:21 +0000] [973422] [INFO] GET /heartbeat
[2024-11-25 20:11:22 +0000] [973422] [INFO] GET /heartbeat

In this sample, it appears that those calls that were recorded as a 499 did eventually get received by the LMS and were all processed in a burst about 10 seconds after workers actually started.

Evaluation

After the 503s end: Find the number of seconds from the first 499 to the first 200. This is the "startup period".

timmc-edx · 2024-11-25T20:56:07Z

Experiments

Profiling off

With profiling off (no profiling-related settings), the startup period lasts 12 seconds.

Repro

With the below profiling config, which is what we most recently used in the stage environment, the startup period lasts 20 seconds.

export DD_PROFILING_ENABLED=true
export DD_PROFILING_STACK_V2_ENABLED=true
export DD_PROFILING_TIMELINE_ENABLED=true

Baselines

Just with profiling enabled, nothing else:

export DD_PROFILING_ENABLED=true

19 seconds (with one 499 a few seconds after the first 200s); 18; 18

Profiling, but v2 stack:

export DD_PROFILING_ENABLED=true
export DD_PROFILING_STACK_V2_ENABLED=true

21; 22; 21

Experiment design

I'll keep this disabled for now since it's not needed for repro, and since we'll probably only want to use it when we want to actually look at the generated profiles:

DD_PROFILING_TIMELINE_ENABLED

To experiment with:

DD_PROFILING_API_TIMEOUT which defaults to 10 seconds -- try shortening this to 1 second.
DD_PROFILING_CAPTURE_PCT defaults to 1.0 -- try 0.1 or 10.
These default to true; try turning them off:
- DD_PROFILING_ENABLE_CODE_PROVENANCE
- DD_PROFILING_ENDPOINT_COLLECTION_ENABLED
- DD_PROFILING_STACK_ENABLED
- DD_PROFILING_MEMORY_ENABLED
- DD_PROFILING_LOCK_ENABLED
- DD_PROFILING_LOCK_NAME_INSPECT_DIR (only relevant when lock profiling is enabled)
- DD_PROFILING_HEAP_ENABLED
These are off by default; try turning them on:
- DD_PROFILING_STACK_V2_ENABLED (only relevant stack profiling is enabled)
- DD_PROFILING_EXPORT_LIBDD_ENABLED

timmc-edx · 2024-11-25T21:17:11Z

With a baseline of export DD_PROFILING_ENABLED=true and export DD_PROFILING_STACK_V2_ENABLED=true (since v2 is what DD wants everyone to switch to soon anyhow)...

DD_PROFILING_API_TIMEOUT=1: 19, 20, 21
DD_PROFILING_CAPTURE_PCT=0.1: 21, 19, 21

On to the toggles...

Turning every profiling feature off (except for profiling itself) gets to the "good" situation:

export DD_PROFILING_STACK_ENABLED=false
export DD_PROFILING_MEMORY_ENABLED=false
export DD_PROFILING_HEAP_ENABLED=false
export DD_PROFILING_ENABLE_CODE_PROVENANCE=false
export DD_PROFILING_ENDPOINT_COLLECTION_ENABLED=false
export DD_PROFILING_LOCK_ENABLED=false

11, 9, 11

export DD_PROFILING_ENABLE_CODE_PROVENANCE=false
export DD_PROFILING_ENDPOINT_COLLECTION_ENABLED=false
export DD_PROFILING_LOCK_ENABLED=false

19, 19

export DD_PROFILING_STACK_ENABLED=false
export DD_PROFILING_MEMORY_ENABLED=false
export DD_PROFILING_HEAP_ENABLED=false

12, 11, 11

export DD_PROFILING_STACK_ENABLED=false

18, 17, 17

export DD_PROFILING_HEAP_ENABLED=false

16, 16, 16

export DD_PROFILING_MEMORY_ENABLED=false

14, 13, 13

export DD_PROFILING_MEMORY_ENABLED=false
export DD_PROFILING_HEAP_ENABLED=false

13, 15, 15

timmc-edx · 2024-11-26T16:12:55Z

More experiments...

Profiling on (but v2 stack not enabled), memory disabled:

export DD_PROFILING_ENABLED=true
export DD_PROFILING_MEMORY_ENABLED=false

13, 12, 13

timmc-edx · 2024-12-06T15:32:21Z

Also able to reproduce this on devstack.

Setup

Check out timmc/datadog-local-testing in devstack and follow the usual devstack-datadog setup
Adjust datadog/wrap-datadog.sh with profiling settings as below.

Change datadog/lms-server.sh to use gunicorn instead of runserver:

SERVICE_VARIANT=lms
SERVICE_PORT=18000

export DJANGO_SETTINGS_MODULE=${SERVICE_VARIANT}.envs.devstack
gunicorn \
    -c /edx/app/edxapp/edx-platform/${SERVICE_VARIANT}/docker_${SERVICE_VARIANT}_gunicorn.py \
    --name ${SERVICE_VARIANT} \
    --bind=0.0.0.0:${SERVICE_PORT} \
    --max-requests=1000 \
    --access-logfile \
    - ${SERVICE_VARIANT}.wsgi:application

Change lms/docker_lms_gunicorn.py to use workers = 3

Measurements

Baseline:

[11, 15, 14, 14] seconds from "Booting worker" to first GET 200 in logs — 13.4 seconds geometric mean

Profiling enabled:

export DD_PROFILING_ENABLED=true
export DD_PROFILING_STACK_V2_ENABLED=true

[27, 32, 31, 28] — 29.4 seconds geometric mean

Profiling, but not memory:

export DD_PROFILING_ENABLED=true
export DD_PROFILING_STACK_V2_ENABLED=true
export DD_PROFILING_MEMORY_ENABLED=false

[19, 19, 20, 15, 20] — 18.5 seconds geometric mean

robrap added this to Arch-BOM Jul 29, 2024

robrap converted this from a draft issue Jul 29, 2024

robrap moved this to Backlog in Arch-BOM Jul 29, 2024

robrap changed the title ~~Enable APM profiling for edxapp~~ [Post-NR] Enable APM profiling for edxapp Aug 21, 2024

robrap changed the title ~~[Post-NR] Enable APM profiling for edxapp~~ Enable APM profiling for edxapp Aug 21, 2024

jristau1984 moved this from Backlog to Ready For Development in Arch-BOM Oct 21, 2024

dianakhuang self-assigned this Oct 21, 2024

dianakhuang moved this from Ready For Development to In Progress in Arch-BOM Oct 21, 2024

dianakhuang added a commit to edx/configuration that referenced this issue Nov 12, 2024

feat: Use v2 of the Datadog Profiler stack.

4bae081

It seems like the newer version might be more efficient, so we should switch to using it. edx/edx-arch-experiments#749

dianakhuang mentioned this issue Nov 12, 2024

feat: Use v2 of the Datadog Profiler stack. edx/configuration#100

Merged

3 tasks

robrap added the waiting-on-other-team label Nov 20, 2024

robrap moved this from In Progress to Blocked in Arch-BOM Dec 2, 2024

robrap assigned timmc-edx Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable APM profiling for edxapp #749

Enable APM profiling for edxapp #749

robrap commented Jul 29, 2024 •

edited by timmc-edx

Loading

robrap commented Oct 16, 2024

timmc-edx commented Oct 30, 2024

timmc-edx commented Nov 25, 2024 •

edited

Loading

timmc-edx commented Nov 25, 2024 •

edited

Loading

timmc-edx commented Nov 25, 2024 •

edited

Loading

timmc-edx commented Nov 26, 2024

timmc-edx commented Dec 6, 2024 •

edited

Loading

Enable APM profiling for edxapp #749

Enable APM profiling for edxapp #749

Comments

robrap commented Jul 29, 2024 • edited by timmc-edx Loading

robrap commented Oct 16, 2024

timmc-edx commented Oct 30, 2024

timmc-edx commented Nov 25, 2024 • edited Loading

Profiling setup

Gunicorn repro

Evaluation

timmc-edx commented Nov 25, 2024 • edited Loading

Experiments

Profiling off

Repro

Baselines

Experiment design

timmc-edx commented Nov 25, 2024 • edited Loading

timmc-edx commented Nov 26, 2024

timmc-edx commented Dec 6, 2024 • edited Loading

Setup

Measurements

robrap commented Jul 29, 2024 •

edited by timmc-edx

Loading

timmc-edx commented Nov 25, 2024 •

edited

Loading

timmc-edx commented Nov 25, 2024 •

edited

Loading

timmc-edx commented Nov 25, 2024 •

edited

Loading

timmc-edx commented Dec 6, 2024 •

edited

Loading