Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] V24.3.x rebase #142

Open
wants to merge 367 commits into
base: v24.2.x
Choose a base branch
from
Open

Conversation

ballard26
Copy link

@ballard26 ballard26 commented Sep 11, 2024

Note that this PR is only open for the purpose of letting folks review the rebase before I push it to a dedicated v24.3.x branch. It will not be merged into the v24.2.x branch.

xemul and others added 30 commits June 3, 2024 20:59
There are several options what invoke_on_...() can try to invoke. So far
only non-const and non-mutable callables are in use and it works.

If calling it with mutable lambdas, it used to fail compilation, but
previous patch fixed it, so here's the test.

It also makes sense to support calling it on const sharded<> object
(with the lambda accepting const service reference), but it's not that
simple (see scylladb#2278)

Signed-off-by: Pavel Emelyanov <[email protected]>
The method actualy shuts the socket down the hard way. There will appear
a less disruptive call, so the existing one deserves some better name.

Signed-off-by: Pavel Emelyanov <[email protected]>
When a socket is closed after noticing the EOF message in its shared
buffer, it doesn't resolve the "shutdown" future. This is not very nice,
because socket is indeed dead, and no other code would resolve it, so
it's good to do it on close().

In fact, it is the peer who has to resolve this future, but it should
happen only after the queue with messages is drained and it complicates
the logic. Resolving it on close is good enough.

In the non-EOF case the socket is aborted, which also resolve the
aforementioned future.

Signed-off-by: Pavel Emelyanov <[email protected]>
There's such a wrapper around loopback connection factory that fits it
to http client needs. Several tests copy-and-pase it, new tests would
appear soon, it's time to generalize it.

Signed-off-by: Pavel Emelyanov <[email protected]>
It was added by 15942b9 (http/client: Count and limit the number of
connections) but wasn't documented, so do it now.

Signed-off-by: Pavel Emelyanov <[email protected]>
Currently consuming http response doesn't check if parser failed reading
the response check. Add it.

Also, when parser sees correct, but incomplete response it sets itself to
.eof(), not .failed(). This case should be told from invalid, so the
system error with ECONABORTED status is thrown instead.

Signed-off-by: Pavel Emelyanov <[email protected]>
This needs splitting connection::get_connection() into get_ and
make_connection. Both to facilitate next patching

Signed-off-by: Pavel Emelyanov <[email protected]>
Currently request is passed by value all the way down to connection
where it's moved to stable memory with do_with(std::move(req), ...) and
then processing continues.

Next patches are going to teach client restart requests over different
connections, so the request should be moved into do_with-context at the
client side.

While at it -- keep the response handler in the stable context as well,
next patching will appreciate this change.

Signed-off-by: Pavel Emelyanov <[email protected]>
Currently if client::make_request() fails, the error is propagated back
to the caller whatever the error is. Sometimes the request fails because
of connection glitch, which can happen and in that case it's common to
try the same request again after a while in a hope that it was indeed
some teporary glitch.

Signed-off-by: Pavel Emelyanov <[email protected]>
…ue' from Patryk Wróbel

By default io_tester uses the size of a file
as the allocation extent size hint. However,
certain tests may want to specify some exact
value for that parameter e.g. to match the
extent size hint configured for a specific product.

This change adds a new parameter to YAML job
description that enables users to explicitly
set some specific value for extent size hint.

Closes scylladb#2280

* https://github.com/scylladb/seastar:
  io-tester.md: update available parameters for job description
  io_tester: expose extent_allocation_size_hint via job param
c-ares marked some APIs deprecated in 1.28.1.

in this change, we conditionally use the undeprecated APIs when
they are available. please note, we don't specify the minimal
supported c-ares version in our building system.

in which, ares_fds() and ares_process() are not changed yet, because
we need to change the way how to poll the events for name resolution.
this would need more thoughts before moving forward.

Refs scylladb#2197
Signed-off-by: Kefu Chai <[email protected]>

Closes scylladb#2200
reduce image size
reduce layers amount
format
combine apt-get update with apt-get install in the same RUN statement
use arguments JSON notation for CMD: https://github.com/hadolint/hadolint/wiki/DL3025\#rationale
it reduces the layer's size. also we do not need this file in running container
it makes maintenance easier and helps to avoid duplication of packages and make the list much easier to update
There's an issue with maintaining keep-alive sockets by http client.

Some prior art first.

The HTTP client uses seastar connections to send requests over and receive responses back. Typically the connection is a TCP socket (or TLS connection which is a wrapper over TCP socket too). Once request-response cycle is over the connection used for that is kept by a client in a "pool". Making another http request may then pick up the existing connection from the pool thus avoiding the extra latency of establishing new connection. Pool may thus accumulate more than one connection if the client user sends several requests in parallel.

```mermaid
flowchart TD
  A[ Start ] --> B{ pool is not empty }
  B -->| y | C[ pop connection from pool ]
  B -->| n | D[ establish new connection ]
  C --> E[ talk to server ]
  D --> E
  E --> F{ got error? }
  F --> | n | G[ push connection to pool ]
  F --> | y | H[ Done ]
  G --> H
```

HTTP servers may sometimes want to terminate the connections it keeps. This can happen in one of several ways.

The "gentle" way is when server adds the "connection: close" header to its response. In that case client would handle the response and will not put the connection back to pool, but instead would just close it. So next request would either pick some other connection from pool or would need to establish a new one.

Less gentle way a server may terminate a connection is by closing it, so the underlying TCP stack would communicate regular TCP FIN-s. On the client side there's connected_socket::wait_input_shutdown() method that returns a future that gets resolved when kernel terminates the connection by peers request. In case client's connection is kept in pool it will be closed and removed from the pool behind the scenes. Next request won't even notice that -- it will either pick some other connection from pool, or will establish a new one.

Sometimes more unpleasant situation happens. It can be either a race or deliberate server's "abort". In the former case, server closes the connection and TCP starts communicating FIN-s in parallel with client picking up a connection for its new request. In that case, even if kernel resolves "input-shutdown" event described above, due to seastar event loop and batch flusher, client would see that the connection is closed _after_ it had picked it from pool and had chance to put some data into it. In the latter case server closes the connection in the middle of reading the request from the client or even in the middle of writing back the response. This is very unlikely, but still happens from time to time.

Having said the above, the problem -- when user calls `client::make_request()` and client steps on the "unpleasant" server-side connection closure, the request making resolves with exception and user has to do something about it. There are two options here -- handle the exception somehow or ask client to make the same request again (restart the request). This PR suggest that the latter choice can be implemented in the HTTP client code. If user doesn't want to restart, it may ask client not to do it, the new API allows for that.

First (a side node) -- to tell the "server closed its socket" from other errors the exception from scoket IO is checked to be the system error with EPIPE or ECONABORTED code in it. EPIPE is reported when it comes from writing the request, ECONABORTED is reported when it comes from reading the response.

Next, there's only one way to replay the request -- client should get new connection somehow and repeat the request-response cycle. There are two options where to get new connection from -- from pool (if it's there) or establish a new one. While experimenting with HTTP client I noticed that picking up connection from pool often results in several "transport exception"-s in a row, as if server was closing connections in batches. So this PR makes a shortcut for restarts it always establishes new connection to server (which, after request-response cycle, is put back into pool normally).

```mermaid
flowchart TD
  A[ Start ] --> B{ pool is not empty }
  B -->| y | C[ pop connection from pool ]
  B -->| n | D[ establish new connection ]
  C --> E[ talk to server ]
  D --> E
  E --> F{ got error? }
  F --> | n | G[ push connection to pool ]
  F --> | y | J{ should restart }
  J --> | n | H[ Done ]
  J --> | y | D
  G --> H
```

And the last, but not least, restart only happens once, because the chance of observing "transport exception" from newly established connection is considered to be low enough not to care. Respectively, if a request is made over new connection (not over a connection from pool) and "transport exception" pops up it's not restarted.

Closes scylladb#1883

* github.com:scylladb/seastar:
  http/client: Retry request over fresh connection in case old one failed
  http/client: Fix indentation after previous patch
  http/client: Pass request and handle by reference
  http/client: Introduce make_new_connection()
  http/client: Fix parser result checking
  http/client: Document max_connections
  test/http: Generalize http connection factory
  loopback_socket: Shutdown socket on EOF close
  loopback_socket: Rename buffer's shutdown() to abort()
for better readability

Signed-off-by: Kefu Chai <[email protected]>
`count` is never used after being assigned, so drop it.

Signed-off-by: Kefu Chai <[email protected]>
lint dockerfile ([linter](https://hadolint.github.io/hadolint/)):
- reduce image size
- reduce layers amount
- prevent caching issues related to apt-get commands usage
- use arguments JSON notation for CMD: [rationale](https://github.com/hadolint/hadolint/wiki/DL3025#rationale)

sort packages alphanumerically to make maintenance easier

Closes scylladb#2283

* https://github.com/scylladb/seastar:
  scripts: sort packages alphanumerically
  docker: bind the file instead of copying during the build stage
  docker: lint dockerfile
…fu Chai

in this series, we move all tests in `test_metrics.py` into `prometheus_test.py ` to drop the repeating code.

Closes scylladb#2254

* https://github.com/scylladb/seastar:
  tests: test protobuf support in prometheus_test.py
  tests: enable prometheus_test.py to test metrics without aggregation
in 6de08b0, we introduced the dependency to pyyaml, but we failed
to update `install-dependencies.sh` to install it.

in this change, we add this dependency to `install-dependencies.sh`.

Refs 6de08b0
Signed-off-by: Kefu Chai <[email protected]>
before this change, we do not verify that `stream` produces the
consumed values.

after this change, we check if the subscriber of `stream` gets the
value sent by the writer side.

Signed-off-by: Kefu Chai <[email protected]>
`write(output_stream<char>&, const jsonable&)` would call
`jsonable::to_json` which would just serialize the whole object (and its
children) to a string and then dump that onto the `output_stream`.

Instead call `jsonable::write(output_stream<char>&)` which will
recursively call `write` (for the `jsonbase` subclasses)  and hence
propagate the benefits of writing to the output stream directly.

To make sure that the object is alive for the duration of the call we
change the overload to copy its argument.
This is to avoid potential lifetime issues as these overloads might
yield internally and hence the input needs to be kept alive for the
duration of that.
Adds a basic json formatter test for jsonable objects and one for
streaming them via `stream_range_as_array`. The latter is also a
reproducer for the stack-use-after-return.
I was actually surprised to see that `then()` no longer does a preemption check (and this certainly leads to better code generation!).

I checked the tutorial and notice that it refers a different mechanism entirely, a 256-continuation counter but I can't find any trace of this in the code, so I assume it preceeded the time-based preemption check which has subsequently been removed. 

So:

 - Rip out mention of the 256-counter
 - Talk about preemption which isn't really covered in the tutorial and add that then() is not a suspension point (but co_await is, by default)
… Travis Downs

I was actually surprised to see that `then()` no longer does a preemption check (and this certainly leads to better code generation!).

I checked the tutorial and notice that it refers a different mechanism entirely, a 256-continuation counter but I can't find any trace of this in the code, so I assume it preceeded the time-based preemption check which has subsequently been removed.

So:

 - Rip out mention of the 256-counter
 - Talk about preemption which isn't really covered in the tutorial and add that then() is not a suspension point (but co_await is, by default)

Closes scylladb#2292

* github.com:scylladb/seastar:
  tutorial.md: fix typos
  Update tutorial.md to reflect update preemption methods
  tutorial.md: remove trailing whitespace
michael-redpanda and others added 12 commits September 10, 2024 23:54
GnuTLS permitted a user to provide a chain of certificates in the
certificate file passed to set_x509_key_file rather than
set_x509_trust_file.  This commit adjusts how the OpenSSL implementation
parses the provided certificate file so it can also accept a chain of
certs in the certificate file parameter to set_x509_key_file.

Signed-off-by: Michael Boquard <[email protected]>
Once the shutdown alert is sent, if anymore data is received on the SSL
session, the SSL_shutdown method will return
SSL_R_APPLICATION_DATA_AFTER_CLOSE_NOTIFY error code.  This may be due
to a race condition where data is received immediately after the close
notify is sent.  In this situation, we will simply ignore the error and
continue shutdown.

Added the drain_openssl_error_stack method that will purposefully empty
out OpenSSL's error stack in situations where errors can be safely
ignored.

Signed-off-by: Michael Boquard <[email protected]>
The prometheus text protocol requires that the label value to be
escaped. Note that this is only needed for the test protocol and not for
the protobuf-based prometheus protocol.

> label_value can be any sequence of UTF-8 characters, but the backslash
> (\), double-quote ("), and line feed (\n) characters have to be
> escaped as \\, \", and \n, respectively.

From https://github.com/prometheus/docs/blob/main/content/docs/instrumenting/exposition_formats.md
Avoid a string copy when writing the metric value and the metric type to
the prometheus text-based response.
Adds a test to verify that the metrics returned with the prometheus text
format are formatted as expected.
Fixes up an earlier commit which should have been a refactor commit but
had an unintended change to the way `double`'s were formatted on the
prometheus response.

Note that now `fmt::print` is used instead of the originally used
`std::to_string`. This is because `std::to_string` is going to change
its formatting behaviour in C++26, so we prefer to use a more
deterministic formatter here.

Fixes up 864c209
Fixes scylladb#2154

TLS1.3 session tickets are a client-storage solution for session
resume, bypassing some of the more expensive parts of a normal
handshake.

Implemented by adding an optional session data parameter to
tls_options, options to certificates for server keys, and query
functions for state/session data on the client side.

Note that the feature requires both client and server to handle
tls1.3.

Server session keys are kept in the certificates object, and can
be rotated by simply re-enabling the feature on the object.

Reloadable certificates will rotate keys on cert reload.

Simple test included.

Redpanda note:

Chery-picked from upstream c224fe0.

Added enough code to make OpenSSL implementation build (test will fail).

Signed-off-by: Michael Boquard <[email protected]>
`future::get0()` was deprecated. so let's use `future::get()` instead.
this addresses following build failure:

```
FAILED: tests/unit/CMakeFiles/test_unit_tls.dir/tls_test.cc.o
/usr/bin/g++-13 -DBOOST_ALL_DYN_LINK -DBOOST_NO_CXX98_FUNCTION_BASE -DFMT_SHARED -DSEASTAR_API_LEVEL=7 -DSEASTAR_BUILD_SHARED_LIBS -DSEASTAR_DEBUG -DSEASTAR_DEBUG_PROMISE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_DEFERRED_ACTION_REQUIRE_NOEXCEPT -DSEASTAR_HAS_MEMBARRIER -DSEASTAR_HAVE_ASAN_FIBER_SUPPORT -DSEASTAR_HAVE_HWLOC -DSEASTAR_HAVE_NUMA -DSEASTAR_HAVE_SYSTEMTAP_SDT -DSEASTAR_HAVE_URING -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TESTING_MAIN -DSEASTAR_TESTING_WITH_NETWORKING=1 -DSEASTAR_TYPE_ERASE_MORE -I/home/circleci/project/tests/unit -I/home/circleci/project/src -I/home/circleci/project/include -I/home/circleci/project/build/debug/gen/include -I/home/circleci/project/build/debug/gen/src -g -std=gnu++20 -U_FORTIFY_SOURCE -Wno-maybe-uninitialized -Wno-error=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -fno-sanitize=vptr -UNDEBUG -Wall -Werror -Wimplicit-fallthrough -Wdeprecated -Wno-error=deprecated -Wno-error=stringop-overflow -Wno-error=array-bounds -gz -MD -MT tests/unit/CMakeFiles/test_unit_tls.dir/tls_test.cc.o -MF tests/unit/CMakeFiles/test_unit_tls.dir/tls_test.cc.o.d -o tests/unit/CMakeFiles/test_unit_tls.dir/tls_test.cc.o -c /home/circleci/project/tests/unit/tls_test.cc
In file included from /usr/include/boost/test/test_tools.hpp:45,
                 from /usr/include/boost/test/unit_test.hpp:18,
                 from /home/circleci/project/include/seastar/testing/seastar_test.hh:27,
                 from /home/circleci/project/include/seastar/testing/test_case.hh:31,
                 from /home/circleci/project/tests/unit/tls_test.cc:40:
/home/circleci/project/tests/unit/tls_test.cc: In member function 'void test_tls13_session_tickets::do_run_test_case() const':
/home/circleci/project/tests/unit/tls_test.cc:1541:61: error: 'seastar::future<T>::get0_return_type seastar::future<T>::get0() [with T = bool; get0_return_type = bool]' is deprecated: Use get() instead [-Werror=deprecated-declarations]
 1541 |         BOOST_REQUIRE(!tls::check_session_is_resumed(c).get0()); // no resume data
      |                        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
In file included from /home/circleci/project/include/seastar/core/do_with.hh:25,
                 from /home/circleci/project/tests/unit/tls_test.cc:25:
/home/circleci/project/include/seastar/core/future.hh:1376:22: note: declared here
 1376 |     get0_return_type get0() {
      |                      ^~~~
/home/circleci/project/tests/unit/tls_test.cc:1544:57: error: 'seastar::future<T>::get0_return_type seastar::future<T>::get0() [with T = std::vector<unsigned char>; get0_return_type = std::vector<unsigned char>]' is deprecated: Use get() instead [-Werror=deprecated-declarations]
 1544 |         sess_data = tls::get_session_resume_data(c).get0();
      |                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
/home/circleci/project/include/seastar/core/future.hh:1376:22: note: declared here
 1376 |     get0_return_type get0() {
      |                      ^~~~
/home/circleci/project/tests/unit/tls_test.cc:1584:29: error: 'seastar::future<T>::get0_return_type seastar::future<T>::get0() [with T = bool; get0_return_type = bool]' is deprecated: Use get() instead [-Werror=deprecated-declarations]
 1584 |         BOOST_REQUIRE(f.get0()); // Should work
      |                       ~~~~~~^~
/home/circleci/project/include/seastar/core/future.hh:1376:22: note: declared here
 1376 |     get0_return_type get0() {
      |                      ^~~~
cc1plus: all warnings being treated as errors
```

Refs c224fe0
Signed-off-by: Kefu Chai <[email protected]>
The tls session ticket contains sensitive data which needs to be cleared
before it is released with free.

This is documented in the man page of
`gnutls_session_ticket_enable_server`:
```
Request that the server should attempt session resumption using
session tickets, i.e., by delegating storage to the client.
 key must be initialized using
gnutls_session_ticket_key_generate().  To avoid leaking that key,
use gnutls_memset() prior to releasing it.
```

(cherry picked from commit 5e9b43a)
@ballard26
Copy link
Author

@michael-redpanda @pgellert If you folks could take a close look at the final TLS code and see if it's correct I'd greatly appreciate it. Also can one of you take a look at the TLS tests and see if the following failure is expected?

[brandonallard@fedora seastar]$ build/release/tests/unit/tls_test                                                                                                                             
Running 40 test cases...                       
INFO  2024-09-11 15:34:12,633 seastar - Reactor backend: linux-aio                                                                                                                            
INFO  2024-09-11 15:34:12,636 seastar - Perf-based stall detector creation failed (EACCESS), try setting /proc/sys/kernel/perf_event_paranoid to 1 or less to enable kernel backtraces: falling back to posix timer.                         
INFO  2024-09-11 15:34:12,636 cpu_profiler - Perf-based cpu profiler creation failed (EACCESS), try setting /proc/sys/kernel/perf_event_paranoid to 1 or less to enable kernel backtraces: falling back to posix timer.                      
random-seed=2246926787                         
Established connection                         
Killing server side                            
Got expected exception                         
unknown location(0): fatal error: in "test_simple_x509_client_server_client_auth_combined_cert_file": std::runtime_error: Unexpected EOF
/home/brandonallard/data/projects/seastar-rebase/seastar/src/testing/seastar_test.cc(43): last checkpoint                                                                                     
unknown location(0): fatal error: in "test_dn_name_handling": std::invalid_argument: No DH level set                                                                                          
/home/brandonallard/data/projects/seastar-rebase/seastar/src/testing/seastar_test.cc(43): last checkpoint                                                                                     
unknown location(0): fatal error: in "test_alt_names": std::invalid_argument: No DH level set                                                                                                 
/home/brandonallard/data/projects/seastar-rebase/seastar/src/testing/seastar_test.cc(43): last checkpoint                                                                                     
unknown location(0): fatal error: in "test_skip_wait_for_eof": std::invalid_argument: No DH level set                                                                                         
/home/brandonallard/data/projects/seastar-rebase/seastar/src/testing/seastar_test.cc(43): last checkpoint                                                                                     
unknown location(0): fatal error: in "test_tls13_session_tickets": std::invalid_argument: No DH level set                                                                                     
/home/brandonallard/data/projects/seastar-rebase/seastar/src/testing/seastar_test.cc(43): last checkpoint                                                                                     

*** 5 failures are detected in the test module "Master Test Suite"                                                                                                                            

@michael-redpanda
Copy link

@michael-redpanda @pgellert If you folks could take a close look at the final TLS code and see if it's correct I'd greatly appreciate it. Also can one of you take a look at the TLS tests and see if the following failure is expected?

[brandonallard@fedora seastar]$ build/release/tests/unit/tls_test                                                                                                                             
Running 40 test cases...                       
INFO  2024-09-11 15:34:12,633 seastar - Reactor backend: linux-aio                                                                                                                            
INFO  2024-09-11 15:34:12,636 seastar - Perf-based stall detector creation failed (EACCESS), try setting /proc/sys/kernel/perf_event_paranoid to 1 or less to enable kernel backtraces: falling back to posix timer.                         
INFO  2024-09-11 15:34:12,636 cpu_profiler - Perf-based cpu profiler creation failed (EACCESS), try setting /proc/sys/kernel/perf_event_paranoid to 1 or less to enable kernel backtraces: falling back to posix timer.                      
random-seed=2246926787                         
Established connection                         
Killing server side                            
Got expected exception                         
unknown location(0): fatal error: in "test_simple_x509_client_server_client_auth_combined_cert_file": std::runtime_error: Unexpected EOF
/home/brandonallard/data/projects/seastar-rebase/seastar/src/testing/seastar_test.cc(43): last checkpoint                                                                                     
unknown location(0): fatal error: in "test_dn_name_handling": std::invalid_argument: No DH level set                                                                                          
/home/brandonallard/data/projects/seastar-rebase/seastar/src/testing/seastar_test.cc(43): last checkpoint                                                                                     
unknown location(0): fatal error: in "test_alt_names": std::invalid_argument: No DH level set                                                                                                 
/home/brandonallard/data/projects/seastar-rebase/seastar/src/testing/seastar_test.cc(43): last checkpoint                                                                                     
unknown location(0): fatal error: in "test_skip_wait_for_eof": std::invalid_argument: No DH level set                                                                                         
/home/brandonallard/data/projects/seastar-rebase/seastar/src/testing/seastar_test.cc(43): last checkpoint                                                                                     
unknown location(0): fatal error: in "test_tls13_session_tickets": std::invalid_argument: No DH level set                                                                                     
/home/brandonallard/data/projects/seastar-rebase/seastar/src/testing/seastar_test.cc(43): last checkpoint                                                                                     

*** 5 failures are detected in the test module "Master Test Suite"                                                                                                                            

I figured out the test issue. In tls-impl.cc you'll need add:

#ifndef SEASTAR_WITH_TLS_OSSL
#include <gnutls/gnutls.h>
#endif

around where the #include <boost...> stuff is

@ballard26
Copy link
Author

@michael-redpanda Nice! It looks like the include is missing in the v24.2.x branch as well. Will this cause any TLS issues in use? If so, do we want to fix things in the v24.2.x branch first then cherry-pick the changes here?

@michael-redpanda
Copy link

@michael-redpanda Nice! It looks like the include is missing in the v24.2.x branch as well. Will this cause any TLS issues in use? If so, do we want to fix things in the v24.2.x branch first then cherry-pick the changes here?

It should be fine on our v24.2.x branch, this change should only effect the GnuTLS build not the OpenSSL build

@ballard26
Copy link
Author

Note that adding the header resulted in a different TLS test failure. @michael-redpanda and I decided to not get gnutls working for this branch as we don't use it in RP. @michael-redpanda has a separate effort to get gnutls working in a PR he'll be upstreaming.

After compiling seastar with openSSL the following unit test error occurred;

67/80 Test #67: Seastar.unit.tls ..............................***Failed   40.58 sec
[0/1] cd /home/brandonallard/data/projects/seastar-rebase/seastar/build/release && /usr/bin/cmake -E env ASAN_OPTIONS=disable_coredump=0:abort_on_error=1:detect_stack_use_after_return=1 UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1 BOOST_TEST_CATCH_SYSTEM_ERRORS=no /home/brandonallard/data/projects/seastar-rebase/seastar/build/release/tests/unit/tls_test -- -c 2
Running 40 test cases...
INFO  2024-09-20 14:18:05,568 seastar - Reactor backend: linux-aio
INFO  2024-09-20 14:18:05,568 seastar - Perf-based stall detector creation failed (EACCESS), try setting /proc/sys/kernel/perf_event_paranoid to 1 or less to enable kernel backtraces: falling back to posix timer.
INFO  2024-09-20 14:18:05,568 cpu_profiler - Perf-based cpu profiler creation failed (EACCESS), try setting /proc/sys/kernel/perf_event_paranoid to 1 or less to enable kernel backtraces: falling back to posix timer.
random-seed=4140734661
/home/brandonallard/data/projects/seastar-rebase/seastar/tests/unit/tls_test.cc(453): fatal error: in "test_x509_client_tls13_fail": Expected exception
Established connection
Killing server side
Got expected exception

*** 1 failure is detected in the test module "Master Test Suite"
FAILED: tests/unit/CMakeFiles/test_unit_tls_run /home/brandonallard/data/projects/seastar-rebase/seastar/build/release/tests/unit/CMakeFiles/test_unit_tls_run 
cd /home/brandonallard/data/projects/seastar-rebase/seastar/build/release && /usr/bin/cmake -E env ASAN_OPTIONS=disable_coredump=0:abort_on_error=1:detect_stack_use_after_return=1 UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1 BOOST_TEST_CATCH_SYSTEM_ERRORS=no /home/brandonallard/data/projects/seastar-rebase/seastar/build/release/tests/unit/tls_test -- -c 2
ninja: build stopped: subcommand failed.

@michael-redpanda is looking into this.

@ballard26
Copy link
Author

@michael-redpanda was able to figure out the issues and I've pushed a change with his recommended fixes.

The last unit test failure I'll be looking into is the following;

      Start 32: Seastar.unit.websocket                                                                                                                                                        
32/80 Test #32: Seastar.unit.websocket ........................***Failed    0.36 sec                                                                                                          
[0/1] cd /home/brandonallard/data/projects/seastar-rebase/seastar/build/release/tests/unit && /usr/bin/cmake -E env ASAN_OPTIONS=disable_coredump=0:abort_on_error=1:detect_stack_use_after_re
turn=1 UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1 BOOST_TEST_CATCH_SYSTEM_ERRORS=no /home/brandonallard/data/projects/seastar-rebase/seastar/build/release/tests/unit/websocket_test -- -c
 2                                                                                                                                                                                            
Running 2 test cases...                                                                                                                                                                       
INFO  2024-09-20 14:59:18,093 seastar - Reactor backend: linux-aio                                                                                                                            
INFO  2024-09-20 14:59:18,094 seastar - Perf-based stall detector creation failed (EACCESS), try setting /proc/sys/kernel/perf_event_paranoid to 1 or less to enable kernel backtraces: fallin
g back to posix timer.                                                                                                                                                                        
INFO  2024-09-20 14:59:18,094 cpu_profiler - Perf-based cpu profiler creation failed (EACCESS), try setting /proc/sys/kernel/perf_event_paranoid to 1 or less to enable kernel backtraces: fal
ling back to posix timer.                                                                                                                                                                     
random-seed=3497040184                                                                                                                                                                        
/home/brandonallard/data/projects/seastar-rebase/seastar/tests/unit/websocket_test.cc(77): fatal error: in "test_websocket_handshake": critical check websocket_accept == "s3pPLMBiTxaQ9kYGzzh
ZRbK+xOo=" has failed [ != s3pPLMBiTxaQ9kYGzzhZRbK+xOo=]                                                                                                                                      
f.size(): 0                                                                                                                                                                                   
f.size(): 4                                                                                                                                                                                   
/home/brandonallard/data/projects/seastar-rebase/seastar/tests/unit/websocket_test.cc(152): fatal error: in "test_websocket_handler_registration": critical check rs_frame == response_str has
 failed [TEST !=                                                                                                                                                                              
TE]                                                                                                                                                                                           
f.size(): 0                                                                                                                                                                                   
                                                                                                                                                                                              
*** 2 failures are detected in the test module "Master Test Suite"                                                                                                                            
FAILED: tests/unit/CMakeFiles/test_unit_websocket_run /home/brandonallard/data/projects/seastar-rebase/seastar/build/release/tests/unit/CMakeFiles/test_unit_websocket_run                    
cd /home/brandonallard/data/projects/seastar-rebase/seastar/build/release/tests/unit && /usr/bin/cmake -E env ASAN_OPTIONS=disable_coredump=0:abort_on_error=1:detect_stack_use_after_return=1
 UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1 BOOST_TEST_CATCH_SYSTEM_ERRORS=no /home/brandonallard/data/projects/seastar-rebase/seastar/build/release/tests/unit/websocket_test -- -c 2    
ninja: build stopped: subcommand failed.                                                                                                                                                      

@ballard26
Copy link
Author

The websocket failures were from issues with the openSSL SHA1 base64 encoder when compiling with GCC. Normally the sha s3pPLMBiTxaQ9kYGzzhZRbK+xOo= should be encoded as;

73 | 33 | 70 | 50 | 4c | 4d | 42 | 69 | 54 | 78 | 61 | 51 | 39 | 6b | 59 | 47 | 7a | 7a | 68 | 5a | 52 | 62 | 4b | 2b | 78 | 4f | 6f | 3d

however, when compiling with openssl and gcc we get;

73 | 33 | 70 | 50 | 4c | 4d | 42 | 69 | 54 | 78 | 61 | 51 | 39 | 6b | 59 | 47 | 7a | 7a | 68 | 5a | 52 | 62 | 4b | 2b | 78 | 4f | 6f | 3d | 0 | ffffffbe

with openssl and llvm we get the expected encoding though. So since this isn't an issue with llvm, which is the only compiler we use for Redpanda, I'll be ignoring the test failures in GCC and continuing the rebase testing with LLVM only.

@StephanDollberg
Copy link
Member

with the openSSL SHA1 base64 encoder when compiling with GCC

Can you explain? This looks like a UB/memory-safety issue to me. GCC and clang shouldn't shouldn't compute this differently. Are we passing some uninitialized buffer somewhere (in the code or test)?

I'll be ignoring the test failures in GCC and continuing the rebase testing with LLVM only

I am not sure we should ignore this given the severity of the issue. Also we want to upstream this anyway and upstream definitely cares about gcc.

@michael-redpanda
Copy link

michael-redpanda commented Sep 25, 2024

I am not sure we should ignore this given the severity of the issue. Also we want to upstream this anyway and upstream definitely cares about gcc.

This effort is just applying our current changes onto master for our use. Separate effort underway to upstream. Plus we don't use websockets (yet?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.