Skip to content

Commit

Permalink
Merge pull request #7 from openedx/whuang1202/event-bus-kafka
Browse files Browse the repository at this point in the history
Addresses most of #2 by copying code from https://github.com/edx/edx-arch-experiments/tree/7146b62063556b149cb9643a8e5f07d205bca5e9/edx_arch_experiments/kafka_consumer and doing some fixes, tweaks, lint removal, and testing.

Note specifically:

- Switches hardcoded signal to `SESSION_LOGIN_COMPLETED` for ease of testing
- Adds producer command as well, for manual testing
  • Loading branch information
timmc-edx authored Jul 26, 2022
2 parents 8e4bf26 + 3e05743 commit 2a4150e
Show file tree
Hide file tree
Showing 19 changed files with 937 additions and 88 deletions.
6 changes: 4 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@ Kafka implementation for Open edX event bus.
|pypi-badge| |ci-badge| |codecov-badge| |doc-badge| |pyversions-badge|
|license-badge|

Overview (please modify)
------------------------
Overview
--------
This package implements an event bus for Open EdX using Kafka.

The event bus acts as a broker between services publishing events and other services that consume these events.
Expand All @@ -28,6 +28,8 @@ Outside of testing this app, it is best to leave the KAFKA_CONSUMERS_ENABLED set

The repository works together with the openedx/openedx-events repository to make the fully functional event bus.

For manual testing, see `<docs/how_tos/manual_testing.rst>`__.

Documentation
-------------

Expand Down
153 changes: 153 additions & 0 deletions docs/decisions/0002-kafka-based-event-bus.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
2. Kafka-Based Event Bus
========================

Status
------

Provisional

Context
-------

The draft `OEP-52: Event Bus Architecture`_ explains how the Open edX platform would benefit from an event bus, as well as providing some additional decisions around the event bus. One decision is to enable the event bus technology to be pluggable through some abstraction layer.

This still requires selecting a specific technology for the first implementation.

.. _`OEP-52: Event Bus Architecture`: https://github.com/openedx/open-edx-proposals/pull/233

Decision
--------

An initial implementation of an event bus for the Open edX platform will be implemented using `Kafka`_. This implementation will be used by edx.org, and available to the Open edX community.

This decision does not preclude the introduction of alternative event bus implementations based on other technologies in the future.

.. _Kafka: https://kafka.apache.org/

Why Kafka?
~~~~~~~~~~

Kafka is a distributed streaming platform. Kafka's implementation maps nicely to the pub/sub pattern. However, some native features of a message broker are not built-in.

Kafka has been around for a long time. See `Thoughtworks's technology radar introduced Kafka`_ as "Assess" in 2015, and "Trial" in 2016. It never moved up to "Adopt", and also never moved down to "Hold". Read `Thoughtwork's Kafka decoder page`_ to learn more about its benefits and trade-offs, and how it is used.

More recently, the `Thoughtworks's technology radar introduced Apache Pulsar`_ as "assess" in 2020, and the `technology radar introduced Kafka API without Kafka`_ in 2021. This both demonstrates the de facto standard of the Kafka API, but also Thoughtwork's hope to find a less complex alternative.

We believe Apache Kafka is still the right option due to its maturity, documentation, support and community.

.. _Thoughtworks's technology radar introduced Kafka: https://www.thoughtworks.com/radar/tools/apache-kafka
.. _Thoughtwork's Kafka decoder page: https://www.thoughtworks.com/decoder/kafka

.. _Thoughtworks's technology radar introduced Apache Pulsar: https://www.thoughtworks.com/radar/platforms/apache-pulsar
.. _technology radar introduced Kafka API without Kafka: https://www.thoughtworks.com/radar/platforms/kafka-api-without-kafka

Kafka Highlights
~~~~~~~~~~~~~~~~

Pros
^^^^

* Battle-tested, widely adopted, big community, lots of documentation and answers.
* Enables event replay-ability.

Cons
^^^^

* Complex to manage, including likely manual scaling.
* Simple consumers require additional code for some messaging features.

Consequences
------------

* Operators will need to deploy and manage the selected infrastructure, which will likely be complex. If Apache Kafka is selected, there are likely to be a set of auxiliary parts to provide all required functionality for our message bus. However, third-party hosting is also available (see separate decision).
* Most of the consequences of an event bus should relate to `OEP-52: Event Bus Architecture`_ more generally, and hopefully will not be Kafka specific.

Rejected Alternatives
---------------------

Apache Pulsar
~~~~~~~~~~~~~

Although rejected for initial edx.org implementation, `Apache Pulsar`_ remains an option for those looking for an alternative to Kafka.

Pros
^^^^

* Ease of scalability (built-in, according to docs).
* Good data retention capabilities.
* Additional built-in pub/sub features (built-in, according to docs).

Cons
^^^^

* Requires 3rd party hosting or larger upfront investment if self-hosted (kubernetes).
* Less mature (but growing) community, little documentation, and few answers.
* Python built-in schema management is buggy and hard to work with for complex use cases.

Note: Read an interesting (Kafka/Confluent) biased article exploring `comparisons and myths of Kafka vs Pulsar`_.

.. _Apache Pulsar: https://pulsar.apache.org/
.. _comparisons and myths of Kafka vs Pulsar: https://dzone.com/articles/pulsar-vs-kafka-comparison-and-myths-explored

Redis
~~~~~

Pros
^^^^

* Already part of the Open edX platform.

Cons
^^^^

* Can lose acked data, even if RAM backed up with an append-only file (AOF).
* Requires homegrown schema management.

RabbitMQ
~~~~~~~~

Pros
^^^^

* Built-in message broker capabilities like routing, filtering, and fault handling.

Cons
^^^^

* Not built for message retention or message ordering.

AWS SNS/SQS
~~~~~~~~~~~

Pros
^^^^

* Simpler hosting for those self-hosting in AWS.

Cons
^^^^

* Cannot be shared as an open source solution.
* Events are not replayable.

Additional References
---------------------

* Technology comparisons performed by edX.org:

* `Message Bus Rubric Definition <https://docs.google.com/document/d/1lKbOE8HkUk__Cyy5u_yFZ8ju0roPtlxcH1-9yf9hX8I/edit#>`__

* `Message Bus Evaluation <https://docs.google.com/spreadsheets/d/1pA08DQ1h3bov5fL1KTrT0tk2RJseyxPsZCLJACtb3YY/edit#gid=0>`__

* `Pulsar vs Kafka Hosting Comparison <https://openedx.atlassian.net/wiki/spaces/SRE/pages/3079733386>`__

* Third-party comparisons of Kafka vs Pulsar:

* `(Kafka biased) Benchmarking comparison <https://www.confluent.io/blog/kafka-fastest-messaging-system/>`__
* `(Pulsar biased) Performance, Architecture, and Features comparison - Part 1 <https://streamnative.io/en/blog/tech/2020-07-08-pulsar-vs-kafka-part-1/>`__
* `(Pulsar biased) Performance, Architecture, and Features comparison - Part 2 <https://streamnative.io/en/blog/tech/2020-07-22-pulsar-vs-kafka-part-2/>`__
* `(Kafka biased) Twitter's move from Pulsar-like to Kafka <https://blog.twitter.com/engineering/en_us/topics/insights/2018/twitters-kafka-adoption-story>`__

* Third-party comparisons of Kafka vs RabbitMQ:

* `Blog article comparing Kafka and RabbitMQ <https://stiller.blog/2020/02/rabbitmq-vs-kafka-an-architects-dilemma-part-2/>`__
48 changes: 48 additions & 0 deletions docs/decisions/0003-managing-kafka-consumers.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
3. Managing Kafka Consumers
===========================

Status
------
Provisional

Context
-------
As outlined in the ADR on a Kafka-based Event Bus, edX.org has elected to go with Apache Kafka as our event bus implementation. Though the decision presented here is predicated on this particular edX.org decision, it is included to help other Open edX users evaluate Kafka for their own purposes. The standard pattern for consuming events with Kafka is to poll in a loop and process messages as they come in. According to the Confluent team it is a best practice to limit each consumer to a single topic (Confluent is a platform for industry-scale Kafka management)::

consumer.subscribe(["topic"])
while True:
message = consumer.poll()
## process message

This ``while True`` loop means whatever is running this consumer will run infinitely and block whatever thread runs it from doing anything else. Thus, this code cannot be run as part of the regular Django web server. It also would not fit neatly onto a celery task, which would put it in direct competition for workers with all other celery tasks and be difficult to scale as the number of topics increases.

Decision
--------
edX.org will use Kubernetes to manage containers whose sole purpose is to run a management command, which in turn will run a polling loop against the specified topic. This will enable standard horizontal scaling of Kafka consumer groups.

The loop will listen for new events and then kick off the processing code for each event.

The new consumer containers will share access to the database and the same codebase as the backend service it supports. These would all be considered part of the same bounded context (from Domain-Driven Design).

Consequences
------------

* If the new consumer is supporting a backend service that is not yet deployed using Kubernetes, for example, as an AWS instance, care must be taken to ensure consumers are deployed with the rest of the app. Considerations need to be made if one part of this deployment breaks, but not another.

Rejected Alternatives
---------------------

#. Use a recurring/scheduled celery task to consume and process Kafka events

* Celery has several disadvantages, including the difficulty of managing priorities for a queue, that it would be nice to avoid.
* Horizontal scaling of consumer groups would not work correctly using our existing group of celery workers.
* A fixed schedule for processing events would run into issues if the time to process events ever gets longer than the schedule.
* Note: this may be used as a temporary solution for the purpose of iterating, but not as a long-term production solution.

#. Create a new ASG of EC2 instances dedicated to running a consumer management command, similar to how we create instances dedicated to running celery workers

* edX and the industry in general we are moving away from the ASG pattern and on to Kubernetes. Both the ASG approach and the Kubernetes approach would require a substantial amount of work in order to make the number of instances scalable based on number of topics rather than built-in measurements like CPU load. Based on this, it makes more sense to put in the effort in Kubernetes rather than creating more outdated infrastructure.

#. Django-channels

* Research turned up the possibility of using django-channels (websocket equivalent for Django) for use with Kafka, but the design and potential benefit was unclear so this was not pursued further
47 changes: 47 additions & 0 deletions docs/decisions/0004-kafka-managed-hosting.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
4. Kafka Managed Hosting
========================

Status
------

Provisional

Context
-------

* Setting up a `Kafka`_-based event bus can be complex since it comes in multiple parts that all need to be deployed
and maintained together.

* The 2U/edX team responsible for this new infrastructure does not have the capacity to manage this in house.

.. _Kafka: https://kafka.apache.org/

Decision
--------

.. note::

This decision is specific to edx.org and is not a requirement or recommendation for how other deployments in the Open edX community manage their brokers. However, the background may still be useful to other community members.

For edx.org, the initial deployment of `Kafka`_-based event bus will use the software as a servics (SaaS) provider `Confluent`_.

.. _Confluent: https://www.confluent.io/

Additional Background
---------------------

`Amazon MSK`_ is an AWS managed service that supplies the Apache Kafka core platform only.

The `Confluent Platform`_ adds additional capabilities, some of which are only commercially available.

* `Schema Registry <https://www.confluent.io/product/confluent-platform/data-compatibility/>`__
* Monitoring and alerting capabilities (Commercial)
* Self-balancing clusters (Commercial)
* Tiered storage (Commercial) (future feature of Apache Kafka)
* Infinite retention (Commercial - Cloud only)

Also see a useful and biased `comparison of Apache Kafka vs Vendors`_ by Kai Waehner (of Confluent), comparing various providers and distributions of Kafka and related or competitive services. Or see `(Confluent biased) Amazon MSK vs Confluent Cloud <https://www.confluent.io/confluent-cloud-vs-amazon-msk>`__.

.. _Amazon MSK: https://aws.amazon.com/msk/
.. _Confluent Platform: https://www.confluent.io/product/confluent-platform
.. _comparison of Apache Kafka vs Vendors: https://www.kai-waehner.de/blog/2021/04/20/comparison-open-source-apache-kafka-vs-confluent-cloudera-red-hat-amazon-msk-cloud/
56 changes: 26 additions & 30 deletions docs/how_tos/manual_testing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,36 +3,32 @@ Manual testing

The producer can be tested manually against a Kafka running in devstack.

#. Create a "unit test" in one of the test files that will actually call Kafka. For example, this could be added to the end of ``edx_event_bus_kafka/publishing/test_event_producer.py``::

def test_actually_send_to_event_bus():
import random
signal = openedx_events.learning.signals.SESSION_LOGIN_COMPLETED
# Make events distinguishable
id = random.randrange(1000)
event_data = {
'user': UserData(
id=id,
is_active=True,
pii=UserPersonalData(
username=f'foobob_{id:03}',
email='[email protected]',
name="Bob Foo",
)
)
}

print(f"Sending event with random user ID {id}.")
with override_settings(
SCHEMA_REGISTRY_URL='http://edx.devstack.schema-registry:8081',
KAFKA_BOOTSTRAP_SERVERS='edx.devstack.kafka:29092',
):
ep.send_to_event_bus(signal, 'user_stuff', 'user.id', event_data)

#. Make or refresh a copy of this repo where it can be seen from inside devstack: ``rsync -sax -delete ./ ../src/event-bus-kafka/``
#. In devstack, start Kafka and the control webapp: ``make dev.up.kafka-control-center`` and watch ``make dev.logs.kafka-control-center`` until server is up and happy (may take a few minutes; watch for ``INFO Kafka startTimeMs``)
#. Load the control center UI: http://localhost:9021/clusters and wait for the cluster to become healthy
#. In devstack, run ``lms-up-without-deps-shell`` to bring up an arbitrary shell inside Docker networking (LMS, in this case)
#. In the LMS shell, run ``pip install -e /edx/src/event-bus-kafka`` and then run whatever test you want, e.g. ``pytest /edx/src/event-bus-kafka/edx_event_bus_kafka/publishing/test_event_producer.py::test_actually_send_to_event_bus``
#. Go to the topic that was created and then into the Messages tab; select offset=0 to make sure you can see messages that were sent before you had the UI open.
#. Rerun ``rsync`` after any edits
#. In edx-platform's ``cms/envs/common.py``:

- Add ``'edx_event_bus_kafka'`` to the ``INSTALLED_APPS`` list
- Add the following::

KAFKA_CONSUMERS_ENABLED = True
KAFKA_BOOTSTRAP_SERVERS = "edx.devstack.kafka:29092"
SCHEMA_REGISTRY_URL = "http://edx.devstack.schema-registry:8081"

#. In devstack, run ``make devpi-up studio-up-without-deps-shell`` to bring up Studio with a shell.
#. In the Studio shell, run ``pip install -e /edx/src/event-bus-kafka``
#. Test the producer:

- Run the example command listed in the ``edx_event_bus_kafka.management.commands.produce_event.Command`` docstring
- Expect to see output that ends with a line containing "Event delivered to topic"
- Go to the topic that was created and then into the Messages tab; select offset=0 to make sure you can see messages that were sent before you had the UI open.

#. Test the consumer:

- Run the example command listed in the ``edx_event_bus_kafka.consumer.event_consumer.ConsumeEventsCommand`` docstring
- Expect to see output that ends with a line containing "Received SESSION_LOGIN_COMPLETED signal with user_data"
- Kill the management command (which would run indefinitely).

#. Rerun rsync after any edits as needed.

(Any IDA should work for testing, but all interactions have to happen inside devstack's networking layer, otherwise Kafka can't talk to itself.)
10 changes: 10 additions & 0 deletions edx_event_bus_kafka/consumer/README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
========
Consumer
========

Purpose
-------

This part of the library implements event bus consumer patterns using the Confluent Kafka API.

During development, this app will be subject to frequent and rapid changes. Outside of testing this app, it is best to leave the KAFKA_CONSUMERS_ENABLED setting off.
3 changes: 3 additions & 0 deletions edx_event_bus_kafka/consumer/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
"""
Kafka consumer application.
"""
Loading

0 comments on commit 2a4150e

Please sign in to comment.