Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CockroachDB Storage Backend Issues with Auth Server Restart #49365

Open
TeleLos opened this issue Nov 22, 2024 · 0 comments
Open

CockroachDB Storage Backend Issues with Auth Server Restart #49365

TeleLos opened this issue Nov 22, 2024 · 0 comments
Labels
bug c-vdc Internal Customer Reference

Comments

@TeleLos
Copy link
Contributor

TeleLos commented Nov 22, 2024

Expected behavior:
Self hosted Teleport Clusters configured with CockroachDB storage backend should operate without errors when restarted.

Current behavior:
A customer deployed Teleport across three regions (or Data Centers) using CockroachDB as the storage backend. While restarting the Auth Service in the Secondary or Tertiary region, they encountering the following timeout error when the Auth Service attempted to connect to the CockroachDB backend.

Important to note, this issue does not occur in the Primary region of our Teleport cluster. However, the error eventually resolves after multiple attempts, provided the Auth Service receives a response from CockroachDB within 5 seconds.

Bug details:

  • Teleport version
    Issue observed with v16.4.6

  • Recreation steps
    Error observed simply restarting their Auth servers.

  • Auth Server Config

storage:
   type: cockroachdb
   conn_string: 'postgresql://[email protected]:26257/teleport?sslmode=verify-full&pool_max_conns=20'
   audit_events_uri: 'postgresql://[email protected]:26257/teleport_audit?sslmode=verify-full'
   audit_sessions_uri: "s3://<s3bucket>/recordqa?endpoint=https://s3.example.com:9021&insecure=false&disablesse=true&region=us-east"
  • Debug logs
Nov 21 14:16:19 host.example.com teleport[3318557]: 2024-11-21T14:16:19Z DEBU [PROC:1] Initializing auth backend. pid:3318557.1 backend:cockroachdb service/service.go:5947
Nov 21 14:16:19 host.example.com teleport[3318557]: 2024-11-21T14:16:19Z INFO [CRDB] Setting up backend. crdb/crdb.go:92
Nov 21 14:16:19 host.example.com telepor[3318557]: 2024-11-21T14:16:19Z DEBU [CRDB] Error creating database due to insufficient privileges. error:[ERROR: permission denied to create database (SQLSTATE 42501)] common/utils.go:84
Nov 21 14:16:20 host.example.com teleport[3318557]: 2024-11-21T14:16:20Z INFO [CRDB] Starting change feed stream. crdb/changefeed.go:36
Nov 21 14:16:20 host.example.com teleport[3318557]: 2024-11-21T14:16:20Z DEBU [BUFFER] Add Watcher(name=external_audit_storage, prefixes=[/external_audit_storage/cluster], capacity=1024, size=0). backend/buffer.go:278
Nov 21 14:16:20 host.example.com teleport[3318557]: 2024-11-21T14:16:20Z WARN [CRDB] Failed to configure cluster settings kv.rangefeed.enabled = true; crdb/changefeed.go:153
Nov 21 14:16:24 host.example.com teleport[3318557]: 2024-11-21T14:16:24Z INFO [S3] Setting up bucket "s3bucketname", sessions path "/recordqa" in region "us-east". s3sessions/s3handler.go:226
Nov 21 14:16:24 host.example.com teleport[3318557]: 2024-11-21T14:16:24Z INFO [S3] Setup bucket "s3bucketname" completed. duration:452.991745ms s3sessions/s3handler.go:230
Nov 21 14:16:24 host.example.com teleport[3318557]: 2024-11-21T14:16:24Z INFO [PGEVENTS] Setting up events backend. pgevents/pgevents.go:214
Nov 21 14:16:25 host.example.com teleport[3318557]: 2024-11-21T14:16:25Z DEBU [PGEVENTS] Error creating database due to insufficient privileges. error:[ERROR: permission denied to create database (SQLSTATE 42501)] common/utils.go:84
Nov 21 14:16:26 host.example.com teleport[3318557]: 2024-11-21T14:16:26Z DEBU [PGEVENTS] CockroachDB detected. pgevents/pgevents.go:296
Nov 21 14:16:26 host.example.com teleport[3318557]: 2024-11-21T14:16:26Z DEBU [PGEVENTS] Configuring CockroachDB native row expiry pgevents/pgevents.go:280
Nov 21 14:16:31 host.example.com teleport[3318557]: ERROR REPORT:
Nov 21 14:16:31 host.example.com teleport[3318557]: Original Error: *pgconn.errTimeout timeout: context deadline exceeded
Nov 21 14:16:31 host.example.com teleport[3318557]: Stack Trace:
Nov 21 14:16:31 host.example.com teleport[3318557]: github.com/gravitational/teleport/lib/events/pgevents/pgevents.go:284 github.com/gravitational/teleport/lib/events/pgevents.configureCockroachDBRetention
Nov 21 14:16:31 host.example.com teleport[3318557]: github.com/gravitational/teleport/lib/events/pgevents/pgevents.go:248 github.com/gravitational/teleport/lib/events/pgevents.New
Nov 21 14:16:31 host.example.com teleport[3318557]: github.com/gravitational/teleport/lib/service/service.go:1661 github.com/gravitational/teleport/lib/service.(*TeleportProcess).initAuthExternalAuditLog
Nov 21 14:16:31 host.example.com teleport[3318557]: github.com/gravitational/teleport/lib/service/service.go:1860 github.com/gravitational/teleport/lib/service.(*TeleportProcess).initAuthService
Nov 21 14:16:31 host.example.com teleport[3318557]: github.com/gravitational/teleport/lib/service/service.go:1265 github.com/gravitational/teleport/lib/service.NewTeleport
Nov 21 14:16:31 host.example.com teleport[3318557]: github.com/gravitational/teleport/e/tool/teleport/process/process.go:59 github.com/gravitational/teleport/e/tool/teleport/process.NewTeleport
Nov 21 14:16:31 host.example.com teleport[3318557]: github.com/gravitational/teleport/lib/service/service.go:753 github.com/gravitational/teleport/lib/service.Run
Nov 21 14:16:31 host.example.com teleport[3318557]: github.com/gravitational/teleport/e/tool/teleport/main.go:28 main.main
Nov 21 14:16:31 host.example.com teleport[3318557]: runtime/proc.go:271 runtime.main
Nov 21 14:16:31 host.example.com teleport[3318557]: runtime/asm_amd64.s:1695 runtime.goexit
Nov 21 14:16:31 host.example.com teleport[3318557]: User Message: initialization failed
Nov 21 14:16:31 host.example.com teleport[3318557]: configuring CockroachDB retention
Nov 21 14:16:31 host.example.com teleport[3318557]: timeout: context deadline exceeded
Nov 21 14:16:31 host.example.com systemd[1]: teleport.service: Main process exited, code=exited, status=1/FAILURE
Nov 21 14:16:31 host.example.com systemd[1]: teleport.service: Failed with result 'exit-code'.
Nov 21 14:16:31 host.example.com systemd[1]: teleport.service: Consumed 6.854s CPU time.
Nov 21 14:16:31 host.example.com systemd[1]: teleport.service: Scheduled restart job, restart counter is at 2.
Nov 21 14:16:31 host.example.com systemd[1]: Stopped Teleport Service.
@TeleLos TeleLos added bug c-vdc Internal Customer Reference labels Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug c-vdc Internal Customer Reference
Projects
None yet
Development

No branches or pull requests

1 participant