-
Notifications
You must be signed in to change notification settings - Fork 0
/
rss.xml
157 lines (155 loc) · 93.7 KB
/
rss.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/">
<channel>
<title>Artificial Unintelligence.</title>
<link>/</link>
<description>A blog about data science and engineering, machine learning, and distributed system engineering.</description>
<lastBuildDate>Fri, 19 Apr 2024 10:17:38 GMT</lastBuildDate>
<docs>http://blogs.law.harvard.edu/tech/rss</docs>
<generator>https://github.com/jpmonette/feed</generator>
<item>
<title><![CDATA[Welcome, please don't spill crumbs on the carpet.]]></title>
<link>/posts/2019-08-019/Welcome</link>
<guid>/posts/2019-08-019/Welcome</guid>
<content:encoded><![CDATA[<h1 id="ive-given-in-this-is-the-result">I’ve given in. This is the result.</h1><p>So people have urged me to do more self promotion for years. They assure me this will wow employers and build my 'personal brand’, so I’ve finally given in and started this dingy little monospaced beauty you see before you. Actually, the benefit from my end is mostly brushing up on my React.js navigation and especially css skills, which, as you can see, are a bit short on <span class="document_rainbow__8gDhs"><strong>wow</strong></span>.</p><p>Oh yeah, <strong>this might also help some of you who are looking to learn about data engineering, data science, devops, machine learning and whatever buzzword they think up next.</strong> But I can’t really guarantee anything there obviously, and if a data centre catches fire because you took some of my advice, I will not be held liable. <em>(But have you heard about how to parallelise mgcv models in R? Seriously, try dialing up 96 cores of non-linear modelling capacity, it’s sick.)</em></p><figure class="document_figure__1C2Aq"><img src="/static/media/fire.1b6157f0.gif" class="document_image__3offz" alt="Elmo starting fire."/><figcaption>It wasn't me.</figcaption></figure><h2 id="so-why-write-a-blog-genius">So why write a blog, genius?</h2><p>Once I got started thinking about content, I realised that there aren’t many blogs out there getting into the nitty gritty of this field. There’s plenty of vendor marketing content explaining how easy it is <em>(thanks guys, you’ve convinced management that every two month task can be achieved in a week, an expectation which means that the project usually fails)</em>, and there’s plenty of introductory material on how to download sklearn from Github.</p><p>But there’s a whole process that isn’t really talked about amidst the hype - of building platforms to get the data to and from systems, putting metadata around them, sourcing that data in real time (yes, not every process should be run in batch) and updating models and predictions to suit a real time system. Most glaringly, there isn’t much written about what to do when things inevitably go wrong, how to debug a distributed system, trace an error across a cluster, or generally think through how to tackle a problem. <strong>This is notable, because most of us only have jobs due to things going wrong - there’s a reason the machines haven’t taken over yet.</strong></p><p>I think that these are worthwhile things to write about. One thing that hit me in the JS community is that most of the blogs and content focus on problems, and attempts to implement functionality which failed. We don’t get this very often in the data science/engineering world. Now, if that was because the whole JS language is one big problem made bearable only via sucessive layers of questionably performant shims - cough, React - it wouldn’t be me who pointed that out…</p><p>So you should expect to see quite a volume of cynicism - hopefully delivered in variety and at high velocity. This blog is about the warts and all of distributed systems engineering and making them fly the real world.</p><p>In my personal experience, reading about solving problems teaches something more useful than simply seeing the Happy Path reproduced in blog form - you may as well just <code>git clone</code>. And if this blog scares the bejeezus out of you, more’s the better. If I can thin the herd that means more work for me right?! <em>(Also, I make a great senior data scientist due to my supportive mentoring of junior staff - ask anyone.)</em></p>]]></content:encoded>
</item>
<item>
<title><![CDATA[DS I - That's no app. It's a platform.]]></title>
<link>/posts/2019-08-19/Data-Engineering-Pt-1</link>
<guid>/posts/2019-08-19/Data-Engineering-Pt-1</guid>
<content:encoded><![CDATA[<h2 id="a-blog-series-on-data-science-and-engineering">A blog series on data science and engineering</h2><p>Maybe you heard it was the sexiest discipline of the 21st century? I tried to warn you, but the introductory post didn’t scare you off?</p><p>Welcome to the first post in a series on data engineering, data science, and how to make these things actually work. We won’t be writing any code in this edition, we’ll just be outlining the structure of what we’re going to build over the next few posts, and why. We’ll start by talking about this idea of a 'platform’, and what that might entail, then we’ll outline what components we might want on our platform.</p><p>We’ll then code it up (using Scala, Python, JS, whatever comes to hand really) over the following posts. I won’t expect familiarity with the nuances of every language, that’s part of the learning experience I’m aiming for. If I haven’t covered something sufficiently, get me on <a href="https://twitter.com/MilesGarnsey/">Twitter</a> and let me know.</p><p>Now, most blogs like this would start off by telling you to download Python, install Jupyter… then we’d go through a variety of motions, culminating in <strong>the building of a decision tree in scikit-learn, at which point your salary would increase two-fold, your beard would become thick and lustrous (ladies, this applies to you too), and you would develop a passion for obscure areas of mathematics.</strong> Here, have some <span><strong>wow</strong></span>.</p><figure class="document_figure__1UT6w"><img src="/static/media/beard.04aaeab6.gif" class="document_image__3-wOd" alt="A man with a detachable beard."/><figcaption>Knowledge transfer in data science.</figcaption></figure><h2 id="sorry">Sorry…</h2><p>I’m looking to do things a bit more rigorously here, this blog is about doing data science and engineering in the real world, and how to solve the issues that arise. If you obtain the advantage of a beard from reading this blog, it will simply be because you haven’t left your home or showered in a week while you try to debug some mistake I inadvertently included. While I’m sure that you want to hear about the latest Tensorflow model (probably so you can go and use the pre-trained version amirite? 😏) there are good reasons to talk about the platform first.</p><p>It often comes about that we build a platform without realising it. Most of the code we write in relation to a data science project actually has nothing to do with the specific task at hand.</p><h2 id="an-example-application">An example application</h2><p>For an initial example, let’s think about building a system which stores Tweets, while making them available (via an API) to other applications. This is the first bit of functionality we’ll build as a part of this series.</p><p><em>You: goes and downloads the Twitter app from the app store and shuts the browser tab.</em></p><p><strong>Hang on…</strong> My domain is data science and engineering (unsubstantiated rumours suggest I write a blog about it), so let’s add three NFRs to ensure I can contribute something at least slightly novel. Let’s demand that the system be scalable, near real time (which I think is kind of implicit by talking about a real time source anyway, but some may disagree), and offer high availability.</p><p><strong>So our above requirement sounds simple, but there are a few things that should tip us off to the fact that it isn’t. </strong>Firstly, we’ve outlined a need for horizontal scalability. That means that we need to be able to add and remove instances of the application without interruption of service. Secondly, we’ve outlined HA as a requirement - this means we always need sufficient instances to serve requests, and in turn monitoring, triggers and autoscaling to figure out how many instances that is. Finally, we’ve asked for storage, which neccesitates a measure of persistence. In other words, whatever we implement needs to be a horizontally autoscaling highly available distributed system - not straightforward, no matter how good you are at installing Pandas and Numpy.</p><h2 id="the-need-for-a-platform">The need for a platform</h2><p>The platform is going to basically give us two things, a generic data plane and a generic control plane, and will include -</p><ol><li>A <strong>Kafka cluster + Schema registry</strong> to enable scaling and durability of data written, manage all persistent storage concerns, as well as enable rapid failovers.</li><li><strong>Etcd</strong> to do configuration management.</li><li><strong>Kubernetes</strong> to do general cluster things and manage scaling.</li><li><strong>Fluentd</strong> for monitoring.</li></ol><div class="document_sidebarcontainer__1afjl"><p>Why are we doing it this way? Luckily, thanks to the joys of <a href="https://github.com/jamesknelson/create-react-blog">MDX and React.JS</a>, this blog has sidebars to deal with long and slightly sarcastic digressions on matters such as this. <span><strong>wow</strong></span>.</p><span class="document_sidebar__3NMxd"><h2> Why are we doing it this way?</h2><p> Rather than going into specifics on each component’s purpose, I could say that we’re implementing a Kappa Architecture via microservices. Because you’re a learned reader - or maybe just because you have access to Google on your phone - you’d probably understand that whatever we’re building thus addresses the requirements around scalability, availability, near-real-timeness and storage. Whether or not you understood these things and why they work, you’d probably ask no further questions - because I intoned the name of an ancient Greek letter.</p><p> But this blog isn’t about selling you a bridge, so there are a variety of reasons why I didn’t just say <em>mumble mumble, Kappa Architecture, stakeholder value… enhanced… mumble</em>; the key one being that I probably have one version of a a Kappa Architecture Via Microservices in my head, and you have a different version in yours.</p><p> The issue with all of these “architectures” is that they don’t cover a sufficient set of application functionality to warrant the term - I’d think of them as design patterns. Having done the data science/engineering/big data/whatever thing for a while now, I’ve developed the (probably less than novel) opinion that basically there are only four things in the application world - data storage, messaging/APIs, human factors (which is everything from the front-end through to enterprise culture, the project manager, the business analyst, or the project’s stakeholders), and computation. They all need to be covered for something to call itself an architecture, in my view. e.g. a microservices pattern probably says that our microservices application code runs in a docker container, but relatively little about what happens when it needs to communicate. One default assumption is that this will happen via REST, but it isn’t an essential feature, and isn’t always a best practice.</p><p> Sometimes spending the additional verbiage on a real explanation of a design can save a tonne of effort down the track.</p></span><p>So that’s the platform. As for the app? That’s almost the easy part - we’ll use various Kafka libraries and <a href="https://github.com/DanielaSfregola/twitter4s">twitter4s</a>. There are a few others we’ll consider, but they very much sit on the utility side of things.</p><h2 id="the-data-plane---about-kafka">The Data Plane - about Kafka</h2><p>Our data plane will rely mostly on <a href="https://kafka.apache.org/">Kafka</a>, which often advertises itself as some sort of data hub type product, almost as an alternative to a database, data lake, or (at the other end) an ESB or messaging system. It can probably hit those requirements, but they aren’t really where the value lies. The easiest way to explain Kafka is to say that <strong>it offers an integrated data plane for distributed applications and allows them to persist, manage and share state.</strong> If another application wants to inspect that state, Kafka enables this - we can set the retention policies on the data for our use case and then write SQL against the data or interact with it in other ways. If we need bidirectional communication between the two applications, this is also covered, and we can set things up so that communication failures and temporary service unavailabilities on either side are recoverable.</p><p>Kafka offers guarantees around data consistency, durability, and availability and allows us to scale and monitor applications almost infinitely and 'for free’, in terms of the engineering effort required to add such features. <em>NB; As with all things in the enterprise, in practice Kafka is often misconfigured and won’t provide any of these benefits.</em></p><p>In case it isn’t clear, being able to offload the concerns I’ve just mentioned is A Big Deal. Figuring out how to distribute messaging and storage is time consuming from an engineering perspective, adds zero perceived value to the user experience and enables zero functional requirements.</p><h3 id="but-you-have-to-do-it-because-a-404-message-doesnt-hit-anyones-requirements">But you have to do it, because a 404 message doesn’t hit anyone’s requirements.</h3><p>Moreover, the other two parts of the application (human factors and computation) don’t cost the same engineering effort. Hosting front ends in a scalable fashion is basically a solved problem (I mean, the internet works, right?), and if you were looking for advice on how to achieve consistency, durability and availability from the rest of your human factors elements - collectively, wetware, or non-silicon-based-considerations - I suggest an organisational psychology blog might be more your speed, although I’ve personally given up on this. Conversely, computation is arguably so hard that even psychologists aren’t arrogant enough to think they can solve it (or they have yet to hear about the problem, it’s hard to say), so there’s no point in making it a part of the platform.</p></div><h2 id="how-else-could-we-hit-the-same-requirements">How else could we hit the same requirements?</h2><p>It is instructive to consider some of the ways that the architecture we’re presenting here differs from our other options. For example, distributed applications, (especially service mesh designs) will often use RESTful APIs to communicate between components. The issue is, that if a RESTful transaction fails, it isn’t clear how to proceed to avoid data loss. We might make it the sender’s responsibility and retry, but then we’ll need to consider implementing circuit breaker patterns - this becomes complicated quickly. If we use Kafka as a messaging solution, we make it the receiver’s responsibility and simply set a retention policy that will cover our maximum expected outage time.</p><p><strong>REST specialises in synchronous unicast communication patterns, Kafka enables asynchronous multicast communication patterns.</strong></p><p>It is nice to have all of our data management in a single place, rather than having different systems to manage transmission and storage. This allows us to centralise monitoring and configuration, from permissions, metrics on reads and writes, latency and throughput, durability via replication factors, distribution via the number of partitions of the data and retention via cleanup policies and retention times. The alternative is often to configure these individually per-application.</p><p>Throw all your components in docker containers, deploy via Kubernetes and you’ve probably delivered something they’ll call a<strong> Kappa Architecture deployed as microservices</strong>, but I’m also happy to call it a service mesh on a persistent substrate; or otherwise, as directed by marketing.</p><p>The genericity of the solution is great because, to me, storage and messaging are the two most boring parts of an application. I’d much rather just implement a single messaging and storage substrate and focus on the interesting parts like the human factors (how do people use it) and the computation (what does it think it does). Naturally, this led me to develop skills in Kafka and, due to the exigencies of capitalism, I now spend quite a lot of time working on storage and APIs.</p>]]></content:encoded>
</item>
<item>
<title><![CDATA[DS II - Making a Mockery (of a Platform).]]></title>
<link>/posts/2019-08-27/Data-Engineering-Pt-2</link>
<guid>/posts/2019-08-27/Data-Engineering-Pt-2</guid>
<content:encoded><![CDATA[<p>In the <a href="../../2019-08-19/Data-Engineering-Pt-1">last post</a> we discussed a simple requirement to store tweets and make them available to other applications. It quickly became clear that it wasn’t as simple as it looked - in fact, it was so complicated that it actually called for an entire platform to be built, replete with a full data plane and separate control plane.</p><p>Mocking up the platform side of things is the subject of this post, in which we’ll discuss getting a dev environment set up. Once you land in an organisation, the first thing you often want to do is replicate their production environment as closely as possible, ideally locally on your own machine. This allows you to test your application in an environment that mirrors the one it will ultimately run in.</p><h3 id="but-my-code-is-already-unit-tested">“But my code is already unit tested!”</h3><figure class="document_figure__gp_zq"><img src="/static/media/another_hacker.1c323ee1.gif" class="document_image__P0Wy7" alt="Cool guy coding."/><figcaption>No number of unit tests will make you this guy.</figcaption></figure><p>This is a common objection to this approach when you present it to traditional software engineers. But if those tests are like most I’ve seen, I suspect that they might say something very deep and subtle like <code>assert MyClass==MyClass</code>. This is roughly equivalent to the conversation at every bad first date you’ve ever been on. <em>Your tests will agree with a set of statements which only a psychopathic piece of code would fail to agree with, and which tell you nothing about the inner workings of the subject under examination; much less its performance in your particular environment, how it will get along with the rest of the apps if invited to a party, whether it will try to monopolise your limited resources, or knows how to clean up after itself.</em></p><p>Sorry, we were talking about software weren’t we…</p><p>Even if you’ve gone to the trouble of finding a proper test framework and using some sample data (because this is about data science, remember?) <strong>the scale or velocity of that data may be completely different in the real world - this is often the problem you are trying to solve.</strong> So unit testing doesn’t really cut the mustard, <em>and actually the local testing approach we demonstrate here isn’t very thorough either, but it doesn’t require you to go and buy resources on GCP or AWS, so that’s a plus, and you can always do that later.</em></p><div class="document_sidebarcontainer__PXI6m"><span class="document_sidebar__uvsTe"><h3 id="whats-in-the-box-about-the-images-were-using">What’s in the box? About the images we’re using.</h3><p>This is a mock platform, so it is going to be missing things. While you’re welcome to talk to me on Twitter about all the things it is missing, I expect that there will be a vast number. What we are including however, is;</p><ul><li>gcr.io/etcd-development/etcd:v3.3.13</li><li>landoop/fast-data-dev</li><li>fluent/fluentd:v1.3-debian-1</li></ul><p>Oh? You went and ran <code>Docker pull</code> didn’t you? Sorry,<strong> Minikube runs in a VM with a wholly separate filesystem.</strong> You may want to delete those if you’re short of disk space. Minikube creates and deletes VMs all the time, so add the above images to a cache using <code>minikube cache add ${IMAGE}</code> to avoid re-downloading frequently.</p><p>The etcd and fluentd images are provided by the organisations developing both tools. The <a href="https://github.com/Landoop/fast-data-dev">Landoop Kafka image</a> is one that I see out in the industry semi-regularly, it contains a full Kafka setup which includes Zookeeper and a Schema Registry, alongside some GUI tools to inspect your cluster and make sure it is giving the expected results. Feel free to work something up from the official Confluent images or similar if that’s more your speed, but be prepared to spend a great deal of time kludging around with unwieldly arguments with Kafka’s CLI tools <em>(I’m generally a CLI fan, but only where they have been designed for humans, or indeed designed at all)</em>.</p></span><h2 id="pre-requisites">Pre-requisites</h2><p>I will assume you have Docker installed and are reasonably proficient at using it. There are plenty of tutorials to help with that, so I’ll suggest a Google search if you hit issues. You will also need to install <a href="https://kubernetes.io/docs/setup/learning-environment/minikube/">Minikube</a> and <a href="https://kubernetes.io/docs/tasks/tools/install-kubectl/">Kubectl</a>, which we’ll use to mock Kubernetes (but only gently, so as not to hurt its feelings). <strong>I strongly advise enabling autocompletion</strong> for Kubectl (and most CLI’s generally), and future you will be greatful to be one of the few engineers without RSI if you do so.</p><p>I’m running everything on Linux, and you can look up instructions for Windows. <em>(Personally, my advice is “perhaps consider a different career path, I hear accounting is nice and has something to do with numbers, so it must be just like AI - right?” Ironically, I’m relying on readers not taking this advice to keep my potential readership relatively high - because Windows is still quite dominant.)</em></p><p>You’ll also need to ensure you have git installed - presumably you’re familiar with it as well, otherwise its back to Coursera with you. Or you can enjoy the <a href="https://git-man-page-generator.lokaltog.net/">documentation</a> (that link points to a parody, but I can’t tell the difference).</p><h2 id="getting-started---minikube">Getting started - Minikube</h2><p>If you’ve recovered from reading the parody Git manpages, <strong>you can clone the <a href="https://github.com/Miles-Garnsey/DataEngineeringPt2">repo</a> if you’re following along.</strong></p><p>Bring minikube up using <code>minikube start --memory=8192 --cpus=2 --extra-config=apiserver.enable-admission-plugins=PodSecurityPolicy</code>, and wait for… a long time… (Note that there is advice floating about online suggesting you can use something like <code>--extra-config=apiserver.GenericServerRunOptions.AdmissionControl=....</code> - this was deprecated, and leads to fierce and incomprehensible errors, as does the attempted use of things like PersistentVolumeLabel. Meanwhile, various admission plugins are enabled by default, so the only one we need to focus on is <code>PodSecurityPolicy</code>.)</p><span style="color:red"> (**EDIT: September 19:** starting this week I've had some trouble with this approach. Haven't had time to research properly; initially I suspected that minikube may have updated something to require `--extra-config=apiserver.authorization-mode=RBAC`. But having done a bit more reading (per this [link](https://github.com/kubernetes/minikube/issues/3818)), I think I neglected to configure some security policies authorizing the core services - this naturally prevents the cluster from coming up. I've updated the github repo for this post to include the relevant policies in `psp.yaml`, which needs to be copied to `~/.minikube/files/etc/kubernetes/addons/psp.yaml`, and I've updated `k8s-configure.sh` to include this step. I may write an additional post exploring psp.yaml and providing more detail on how it works, it gave me some insight to fix up my podsecuritypolicy settings which I've improved in `devSecurity.yaml`, as I don't believe these were being applied as intended. </span><p>Once it is up, you should have a mock K8s cluster running in a VM, which you can describe using your kubectl commands such as <code>kubectl describe-cluster</code>. If that all works, we’re good to go.</p></div><h2 id="kubernetes-concepts-and-setup">Kubernetes Concepts and setup</h2><p>Kubernetes has a few entities you’ll need to familiar with, notably - nodes, pods, services, controllers and containers. The documentation on the main website is somewhat unsatisfactory when it comes to describing the design succinctly, and I advise you to consult the <a href="https://github.com/kubernetes/community/blob/master/contributors/design-proposals/architecture/architecture.md#the-kubernetes-node">architecture doc</a> for detail. <strong>Summarising, containers are just Docker containers (at least for our purposes), they run in pods on nodes, which are hosts, and are exposed by services which do network related stuff like load balancing, DNS (including discovery etc.), controllers do deployment and scaling. The pod is kind of the interesting part of Kubernetes that differentiates it from Docker compose in many ways.</strong></p><span class="document_sidebar__uvsTe"><h2 id="running-an-application-on-kubernetes">Running an application on Kubernetes</h2><p>Kubernetes is so flexible that the number of configurables can become overwhelming, so it helps to have a checklist of considerations when you’re crafting a deployment. <strong>The basics you want to include are;</strong></p><ol><li>A namespace</li><li>Memory and CPU limits</li><li>A container image</li><li>A service to handle load balancing</li><li>The number of replicas you’ll require</li><li>(Optional, stateful applications only) Any volumes you’ll need.</li></ol><p><em>N.B. - it should be noted that management love checklists for the same reason you like this one, they’re just overwhelmed by more things; if you show your boss this checklist (and pass it off as your own) you will be able to run a meeting about Standardising Processes which will assure you of a tidy bonus while simultaneously earning you the undying contempt of your colleagues.</em> <strong>Good luck!</strong></p></span><h3 id="kubernetes-pods">Kubernetes pods</h3><p>Applications are deployed in pods, and each instance of the application container should be in its own pod. Pods do not handle replication. Instead, <strong>think of them as a way of packaging the application for runtime/production</strong>. A production environment might mandate that all applications use a particular set of subsidiary apps/containers. Many of these might operate separately from the application itself, and this is called the <strong>Sidecar Pattern</strong>. Some of the common things included are proxies that inspect network traffic before forwarding it to the container it was intended for, but there are a variety of use cases around application monitoring, networking, and other stuff that ops people care about and we aren’t going to talk about.</p><p><strong>Side note:</strong> pods are the thing that enable service mesh frameworks like <a href="https://istio.io/">Istio</a> to fly. Istio (for example) installs a hook into K8s which modifies its default pod deployment behaviour and adds additional stuff. I may write a post about this later - one interesting implication is that we could consider taking REST traffic from our existing apps and proxying it to Kafka traffic to ameliorate the deficiencies REST suffers in persistence. There are also technologies such as Cilium, which bundles in additional security and might be worth evaluating for its Kafka interop. I only mention to sketch some of the flexibility available via Kubernetes.</p><h3 id="controllers-services-and-everything-else">Controllers, services… and everything else</h3><p>Examples of controllers include <strong><a href="https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/">replica sets</a> (which is what we’re using here), stateful sets, daemon sets, deployments</strong> and so on. These are just all different ways of getting some set of pods out onto some set of nodes in a cluster and keeping some number there under different conditions (the desired number being something that might change according to various scaling behaviours that we’ll discuss much later). The documentation is sufficent to explain what these do, so I will not cover them in depth beyond mentioning their existence.</p><figure class="document_figure__gp_zq"><img src="/static/media/hacker.d472712f.gif" class="document_image__P0Wy7" alt="A very serious hacker."/><figcaption>Crash override fails on ClusterIP.</figcaption></figure><p>As mentioned above; services do networking things, and in this series we are going to talk purely about <code>ClusterIP</code> services, which do not expose external IPs. If you have the privelege of having customers who might be interested in your app, you would be looking at load balancing, creating static public IP addresses, and other things that would require care and thought. You should not do any of these things unless you understand the security implications. <strong>One more time. Do not do any of those things unless you understand the security implications.</strong></p><p>If you want to add to the K8s API, you’ll be pleased to hear that its entities are customisable. Customised resources that don’t fit within the node/service/pod/container/controller paradigm are quite possible. This is an approach pursued by tools such as <a href="https://www.kubeflow.org/">Kubeflow</a> which adds various machine learning tooling to K8s. This is quite useful, and worth keeping in mind if you plan to run some such service at scale.</p><h3 id="defining-kubernetes-entities">Defining Kubernetes entities</h3><div><pre><code class="language-yaml" data-language="yaml" data-highlighted-line-numbers=""><span class="token key atrule">apiVersion</span><span class="token punctuation">:</span> v1
<span class="token key atrule">kind</span><span class="token punctuation">:</span> Service
<span class="token key atrule">metadata</span><span class="token punctuation">:</span>
<span class="token key atrule">name</span><span class="token punctuation">:</span> kafka<span class="token punctuation">-</span>service
<span class="token key atrule">namespace</span><span class="token punctuation">:</span> dev
<span class="token key atrule">labels</span><span class="token punctuation">:</span>
<span class="token key atrule">app</span><span class="token punctuation">:</span> kafka
<span class="token key atrule">phase</span><span class="token punctuation">:</span> dev
<span class="token key atrule">spec</span><span class="token punctuation">:</span>
<span class="token key atrule">selector</span><span class="token punctuation">:</span>
<span class="token key atrule">app</span><span class="token punctuation">:</span> kafka
<span class="token key atrule">ports</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">port</span><span class="token punctuation">:</span> <span class="token number">9092</span>
<span class="token key atrule">name</span><span class="token punctuation">:</span> kafka
<span class="token punctuation">-</span> <span class="token key atrule">port</span><span class="token punctuation">:</span> <span class="token number">8081</span>
<span class="token key atrule">name</span><span class="token punctuation">:</span> schema<span class="token punctuation">-</span>registry
<span class="token punctuation">-</span> <span class="token key atrule">port</span><span class="token punctuation">:</span> <span class="token number">3030</span>
<span class="token key atrule">name</span><span class="token punctuation">:</span> kafka<span class="token punctuation">-</span>gui
<span class="token punctuation">-</span> <span class="token key atrule">port</span><span class="token punctuation">:</span> <span class="token number">2181</span>
<span class="token key atrule">name</span><span class="token punctuation">:</span> zookeeper
<span class="token key atrule">clusterIP</span><span class="token punctuation">:</span> None
<span class="token punctuation">---</span>
<span class="token key atrule">apiVersion</span><span class="token punctuation">:</span> apps/v1
<span class="token key atrule">kind</span><span class="token punctuation">:</span> ReplicaSet
<span class="token key atrule">metadata</span><span class="token punctuation">:</span>
<span class="token key atrule">namespace</span><span class="token punctuation">:</span> dev
<span class="token key atrule">name</span><span class="token punctuation">:</span> kafka
<span class="token key atrule">labels</span><span class="token punctuation">:</span>
<span class="token key atrule">app</span><span class="token punctuation">:</span> kafka
<span class="token key atrule">phase</span><span class="token punctuation">:</span> dev
<span class="token key atrule">spec</span><span class="token punctuation">:</span>
<span class="token key atrule">replicas</span><span class="token punctuation">:</span> <span class="token number">1</span>
<span class="token key atrule">selector</span><span class="token punctuation">:</span>
<span class="token key atrule">matchLabels</span><span class="token punctuation">:</span>
<span class="token key atrule">app</span><span class="token punctuation">:</span> kafka
<span class="token key atrule">template</span><span class="token punctuation">:</span>
<span class="token key atrule">metadata</span><span class="token punctuation">:</span>
<span class="token key atrule">labels</span><span class="token punctuation">:</span>
<span class="token key atrule">app</span><span class="token punctuation">:</span> kafka
<span class="token key atrule">spec</span><span class="token punctuation">:</span>
<span class="token key atrule">containers</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> kafka
<span class="token key atrule">image</span><span class="token punctuation">:</span> landoop/fast<span class="token punctuation">-</span>data<span class="token punctuation">-</span>dev
<span class="token key atrule">env</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> CONNECT_HEAP
<span class="token key atrule">value</span><span class="token punctuation">:</span> <span class="token string">"1G"</span>
<span class="token punctuation">-</span> <span class="token key atrule">name</span><span class="token punctuation">:</span> ADV_HOST
<span class="token key atrule">value</span><span class="token punctuation">:</span> <span class="token string">"kafka-service.dev"</span>
<span class="token key atrule">resources</span><span class="token punctuation">:</span>
<span class="token key atrule">limits</span><span class="token punctuation">:</span>
<span class="token key atrule">memory</span><span class="token punctuation">:</span> <span class="token string">"3000Mi"</span>
<span class="token key atrule">cpu</span><span class="token punctuation">:</span> <span class="token string">"1"</span>
</code></pre></div><p>As you can see to your right (or above for those poor unfortunates trying to read this long thin piece of text on mobile), Kubernetes stuff is defined in YAML (which stands - in the great recursive acronym tradition pioneered by the likes of GNU - for YAML Ain’t Markup).</p><p>Note that we’ve created -</p><ol><li>A <code>kafka-service</code> to handle ingress and egress to the Kafka cluster - it is a <code>ClusterIP</code> service, which means no access from outside the cluster, and we’ve gone with an <a href="https://kubernetes.io/docs/concepts/services-networking/service/#with-selectors">approach</a> which uses selectors to route traffic to the pod via K8s DNS.</li><li>A replica set with a single replica.</li></ol><p>YAML is an irritating language to define things in; it is highly sensitive to spaces, has an awkward way of defining lists or arrays (basically a <code>-</code> followed by a space). Tabs are a no go in YAML and will break everything in hard to detect ways. If that’s unclear and generally unpleasant to wrap your head around… then good, I’m pleased you’re getting a feel for the format.</p><p>While YAML looks fine visually, and the intention is quite clear, it is painful to type and prone to errors. If something does go wrong, look for the use of tabs where spaces were intended, or the absence of spaces between <code>-</code>s or <code>:</code>s. The only element that bears commenting on is the <code>labels</code> elements - these are just a way to subset your various K8s entities for subsequent selection and manipulation. You can do things like applying services to pods based on labels, which is what is happening above.</p><p>You’ll also <strong>note the almost complete inability to abstract anything away here</strong> - if you want some common feature (e.g. a set of labels or something) across several services, you’re going to need to either use an additional tool (<a href="https://github.com/helm/helm">helm</a> might help), write it out explicitly in each YAML file, or do something with Kubectl to add it to your request (you can do this for <a href="https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/#setting-the-namespace-preference">namespaces</a>).</p><p>Pay attention to which particular K8s API you’re referring to in the <code>apiVersion</code>, as this matters, and dictates which part of the K8s REST API your commands are directed at.</p><p>I was originally going to discuss it in the next post, but decided that presenting broken k8s configs as correct is fairly cruel to anyone just trying to find a template for Kafka. Many systems (Kafka in this case) require an environment variable added the Kubernetes manifest along the lines of <code>ADV_HOST=kafka-service.dev</code>. This lets Kafka know which address it is listening on - Confluent explain it better than I can <a href="https://www.confluent.io/blog/kafka-listeners-explained">here</a>. If you hit connectivity issues with a system running in Docker containers, checking that the app in the container knows which address to advertise itself at is a good first step in resolving them.</p><h2 id="kubernetes-cluster-security">Kubernetes cluster security</h2><p>We should first take steps to make sure our new cluster is secure. Sadly the only secure computer is one that is turned off, at the bottom of the ocean. This being incompatible with, well… basically every other requirement we have - we will have to make do with the clusterSecurity.yaml file in the repo, which ensures that;</p><ul><li>No priveleged mode containers run.</li><li>No containers can run as root.</li><li>Containers can only access NFS volumes and persistent volume claims (for stateful set deployments). In other words, they shouldn’t be touching your local storage.</li></ul><div><pre><code class="language-yaml" data-language="yaml" data-highlighted-line-numbers=""><span class="token key atrule">apiVersion</span><span class="token punctuation">:</span> extensions/v1beta1
<span class="token key atrule">kind</span><span class="token punctuation">:</span> PodSecurityPolicy
<span class="token key atrule">metadata</span><span class="token punctuation">:</span>
<span class="token key atrule">name</span><span class="token punctuation">:</span> minikubesecurity
<span class="token key atrule">spec</span><span class="token punctuation">:</span>
<span class="token key atrule">privileged</span><span class="token punctuation">:</span> <span class="token boolean important">false</span>
<span class="token key atrule">runAsUser</span><span class="token punctuation">:</span>
<span class="token key atrule">rule</span><span class="token punctuation">:</span> MustRunAsNonRoot
<span class="token key atrule">seLinux</span><span class="token punctuation">:</span>
<span class="token key atrule">rule</span><span class="token punctuation">:</span> RunAsAny
<span class="token key atrule">fsGroup</span><span class="token punctuation">:</span>
<span class="token key atrule">rule</span><span class="token punctuation">:</span> RunAsAny
<span class="token key atrule">supplementalGroups</span><span class="token punctuation">:</span>
<span class="token key atrule">rule</span><span class="token punctuation">:</span> RunAsAny
<span class="token key atrule">volumes</span><span class="token punctuation">:</span>
<span class="token punctuation">-</span> <span class="token string">'nfs'</span>
<span class="token punctuation">-</span> <span class="token string">'persistentVolumeClaim'</span>
</code></pre></div><p>This is important because one of the major issues people hit when first using Docker in anger is around the use of the root user in the containers. This is bad practice and often a material security risk in Docker (one of the few), so we want to ensure that if we’re doing it inadvertantly in dev, we get an error immediately rather than encountering mysterious issues at deployment (where we hope that the cluster admin has disabled the capability, as we have!)</p><p>Apply your security policies using <code>kubectl create -f clusterSecurity.yaml</code>.</p><p>The final point to be made on securing this cluster is to take note that <strong>none of the services expose public IP addresses</strong>. To access (for example) the landoop UIs, we’d run something like <code>kubectl --namespace=dev port-forward service/kafka-service :3030</code>, which forwards the port from our local machine over SSH. This is desirable, because we have not configured security on these GUIs, and by keeping them internal to the cluster we can piggyback off kubectl’s authentication mechanisms and simplify our setup. Clearly, not appropriate for production usage.</p><h2 id="kubernetes-services">Kubernetes Services</h2><p>We now have a functioning cluster, so let’s get visibility of it using the Kubernetes Dashboard (we’ll do better than this for monitoring, but we’re still bootstrapping and need some visibility while we do so) using <code>minikube dashboard</code>. On a real cluster we’d deploy pods and a service to handle this, but minikube makes things a bit simpler and lets us avoid creation of service accounts and worrying about authentication and so on.</p><p>Without any further stuffing about, let’s get it hooked up.</p><ol><li>Create a dev user and namespace with <code>kubectl apply -f dev-cluster.yaml</code></li><li>Run <code>kubectl apply -f ${FILE}</code> for each of the components we need (etcd, Fluentd, Kafka) for our dev platform and watch them appear on the dashboard.</li></ol><p>Done. There are many improvements that could be made to this setup, but for writing an application it works as a minimum viable platform. If we were deploying this in the real world, we would have a vastly expanded range of things to think about;</p><ol><li>How does our logging actually work? Fluentd needs somewhere to land its data, and Kafka isn’t a good choice (for reasons that we’ll cover later in the series). We could use Elasticsearch and Kibana (or something similar like Prometheus and Grafana) to visualise our logs from Fluentd.</li><li>Authentication for all these front ends we’re creating, and then Authorization to determine what permissions people should have.</li><li>Where are we going for level two support? If we can’t fix something ourselves who do we bring in and how much money do we have to pay them?</li><li>What network connectivity do we need? Load balancing? CDNs?</li><li>While we’re talking about load, what about autoscaling?</li></ol><p>6 - 100. Everything else.</p>]]></content:encoded>
</item>
<item>
<title><![CDATA[Six Ways Your Kafka Design will Fail.]]></title>
<link>/posts/2019-09-06/About-Kafka</link>
<guid>/posts/2019-09-06/About-Kafka</guid>
<content:encoded><![CDATA[<p>In our blog <a href="../../2019-08-19/Data-Engineering-Pt-1">series</a> on data engineering and data science, we decided to use Kafka as a distributed data plane for the platform. For a fairly simple system, there is quite a lot to say about Kafka, and plenty of ways for you to embarrass yourself while implementing whatever business-value-adding-doovalacky is important this week. In this post we’ll cover some of the most common, dangerous, or strangest.</p><p><em>NB: I promise this will be the last conceptual post for a while and the next several will be pure implementation with substantially more code. While I personally think this post is a bit dry, I was explaining so many Kafka concepts in the following few that I decided to consolidate them all in a single post. I have tried to add some levity by swearing a lot, hopefully this helps.</em></p><h2 id="kafka-basics">Kafka Basics</h2><p>Kafka is a key-value setup, where keys and values are understood only as arbitrary bytes. There are a few concepts to understand in order to use it effectively, including brokers, topics, partitions, replications and consumer groups. Kafka itself does relatively little, it doesn’t understand any data type other than raw bytes, makes no decisions about how to assign data across a cluster, and doesn’t decide how to feed that data back to consumers. <strong>Typically, you’ll be told that keys in Kafka are for either</strong> (a) partitioning/distributing the data on the cluster; or (b) something to do with indexing, like a database table index. Both statements are bullshit, and we’ll explore why below.</p><figure class="document_figure__1_Xbb"><img src="/static/media/facepalm.d3a73841.gif" class="document_image__2FL70" alt="facepalm."/><figcaption>I fucked up, so you don't have to.</figcaption></figure><p>One of the reasons there might be so many misconceptions about the system is that there are a variety of other tools in the ecosystem which are often referred to in the same breath as Kafka (e.g. the Kafka streams library for Scala and Java, or the confluent Avro serialisers which interface with the schema registry - another component commonly deployed with kafka). All of these tools use Kafka’s primitives (partitions, topics, consumer groups) in different ways.</p><p>The schema registry in particular deserves a quick explanation, as it stores <strong>Avro schemata which are often used to serialise messages on Kafka.</strong> The schemata are required for both serialisation and deserialisation, and the schema registry is an external system which can provide these. The Confluent avro serialiser makes use of it to automatically retrieve and write schema in Java and Scala. Similar Kafka clients exist for Python also.</p><div class="document_sidebarcontainer__1fEfN"><h3 id="consumers-and-consumer-groups">Consumers and consumer groups</h3><p>Consumers (of which there may be many for a given topic) subscribe to a partition (or several) within a topic to receive data as it arrives at the brokers. Once data is received, the consumer “commits” an offset to track where it is up to in the stream. In case of failure, this can be used to restart processing.</p><span class="document_sidebar__QQejK"><h3 id="offsets-are-committed">Offsets are committed!</h3><p>Great, but <strong>where</strong> are they committed sorry? Another fact that is often missed about Kafka is that consumers actually make that decision. While traditionally Zookeeper was often used for this purpose, we might equally decide to use some other database, (like etcd). I believe that as of Kafka 0.9, there is a topic in Kafka which handles this, but that isn’t always optimal.</p><p>For a stateless consumer, there’s no reason to track offsets at all. For high performance consumers, or where there are a large number of consumers (e.g. Kafka is backing data delivered via a RESTful web API to a browser app) you might want to make the consumers responsible for tracking their own offsets locally.</p><p><strong>Commit behaviour is often overlooked, but can be a powerful tool.</strong> For example, a microservice might do some processing to each record and send it to an external REST API. It could be configured to commit an offset only once a 200 <code>OK</code> response had been received. In the event of failure in the external API, no offsets are commited and the data remains “unread” as far as Kafka is concerned. Even in the event of a consumer crash concurrently, we can ensure reliable delivery to the third party and data will not be lost.</p></span><p>Consumers form consumer groups, and a given consumer group receives the same data only once. Not every consumer will receive every message on a topic, but it will receive every message on its partition. <strong>You can see how this naturally leads to thinking of a service or application as a consumer group, and the consumers as instances of the application.</strong></p><h3 id="partitions">Partitions</h3><p>Partitions are the primary way Kafka parallelises computations. A partition is just a segment of data which has been allocated to a broker. The broker handles all consumer and producer related activity for that particular segment/shard/partition of data.</p><p>Because consumers, consumer groups, the application logic relying on the consumer, and the number of instances of the application are all tightly linked, <strong>it is highly desirable to avoid changing the number of partitions in production.</strong> All partitions are always allocated out to however many consumers you run, so it makes sense to overprovision partitions for future data growth. You can’t run more application instances than you have partitions.</p><p><strong>For that reason, I usually suggest using a minimum of 9 partitions for large clusters, or 2–3 times the number of brokers for small clusters.</strong> A cluster is rarely smaller than 3 brokers, hence the lower limit of 9. Two partitions per broker doesn’t do much harm, but it does cater for the possibility of growth in the volumes of data (and if your data volume isn’t growing, best start looking for a new job).</p></div><p>One further benefit of overprovisioning your partitions is that it helps with cluster rebalancing. Anyone who’s ever looked at discrete optimisation for any packing problem (or just taken an overseas flight on a budget airline) can attest that the most efficient optimisation is simply to make the components to be packed as small as possible. If we have 9 small partitions instead of 3 big ones, it becomes much easier to fit them onto brokers that might otherwise not have capacity for them.</p><p>Historically <a href="https://www.confluent.io/blog/how-choose-number-topics-partitions-kafka-cluster">fewer partitions were desirable for performance reasons</a>, but more recent versions of Kafka suffer <a href="https://www.confluent.io/blog/apache-kafka-supports-200k-partitions-per-cluster">few performance drawbacks</a> from increasing the number.</p><p>Finally, it bears mentioning that different Kafka designs can make use of different partitioning arrangements. One common misconception (explored below) is that partitioning always happens to same way. Actually, <strong>it is the consumers and producers which decide how partitioning occurs.</strong> Kafka can map from partitions to brokers (and therefore retrieve the data in a particular partition) but it does not keep track of which data is in which partition.</p><h3 id="topics">Topics</h3><p>Topics are the most visible entity in Kafka, so of course, everyone thinks they understand them. They are most usually wrong. Many of the misconceptions below are about topics, and those who repeat them are usually highly confident in their pronouncements.</p><figure class="document_figure__1_Xbb"><img src="/static/media/order.e91267ca.gif" class="document_image__2FL70" alt="Order!"/><figcaption>Keep it orderly.</figcaption></figure><p>While there are analogies to be made between topics and database tables, and while we do often see that a topic will be linked to an Avro schema (and indeed, it might be only one schema, not several) none of these facts actually captures what a topic is for. In a nutshell, topics (at least in a well designed Kafka system) actually encapsulate an ordering of their messages.</p><p>Kafka is a streaming system, so delivery of messages in-order is important. Kafka guarantees that a given consumer will receive all messages on a given partition in the order they were written. There are two catches here -</p><ol><li>Messages between different topics enjoy no such guarantee.</li><li>Message between partitions receive no such guarantee.</li></ol><p>As a result, if you were in a situation where you wanted to preserve the ordering of a particular data stream within your system (for example, you’re implementing an <a href="https://docs.microsoft.com/en-us/previous-versions/msp-n-p/jj591559(v=pandp.10)">event sourcing</a> system), it needs to be mapped (through the partitioning strategy mentioned above) to a single partition. <strong>We give some examples below as to how this plays out later in this post</strong>, as messages being delivered out of order is one consequence of Kafka-fuck-ups numbers five through six, below. They are particularly interesting because <strong>solutions are not very widely available for all language APIs</strong>, and the documentation surrounding them is economical in its insights (if not outright nonsense and probably written by someone in marketing).</p><p><strong>So why do we still talk about topics providing the ordering? </strong> Because topics hold both the partitions and the partitioning strategies, so provided the partitioning strategies are rational (and people tend not to think about these very much due to them often being provided by Confluent and the like) we can assume that we get the ordering guarantees we’re looking for.</p><h3 id="replications">Replications?</h3><p>Replications differ from partitions, while partitions split data across the cluster to shard it out amongst the Kafka instances, replications copy the entire topic (with the same partitioning scheme) to other brokers. This is for durability and availability - so that broker failure cannot cause data loss - but one strange implication is that sometimes brokers will have the entire dataset on them, which seems at odds with the idea of using a distributed system. For example, if we have 3 brokers, partitions and replications, every replication has to be on a different broker, which means that each broker will have a replication of each partition. While this seems to defeat the point of using a distributed system, it is actually no different to the way striped RAID used to work, and once we add more brokers (if we decide we need to) we will find each partition of each replication (9 partitions) can be placed anywhere within the cluster.</p><h1 id="top-misconceptions-about-kafka">Top misconceptions about Kafka</h1><p>Ah the part you were waiting for… What are some of the ways all of this can go wrong? Normally things going wrong starts with someone misunderstanding how the system works and what’s implied by that. There are plenty of ways for this to happen even in a relatively simple system like Kafka.</p><h3 id="1-partitioning-happens-according-to-the-key-therefore">1. Partitioning happens according to the key, therefore…</h3><p>Actually partitioning happens according to whatever partitioning strategy our producers are using. You also want to hope that the consumers know about this, or they won’t know how to process the data they receive. I had this debate with some colleagues at one stage, where we discussed whether using differently named Avro schemata for a key would land messages on different partitions (because the hash of the key would be different due to the differing magic numbers in the two schemata).</p><p>From recollection, some further research turned up the fact that the Confluent Avro Serialiser actually does some work to hash the logical (not serialised) value of the Avro message, in order to correctly allocate it to a partition irrespective of the magic number, schema name, or other considerations.</p><h3 id="2-all-of-our-topics-just-have-three-partitions-theres-no-point-running-more-than-three-brokers">2. All of our topics just have three partitions, there’s no point running more than three brokers…</h3><p>The mistake seems obvious when you’re looking for it, but I’ve heard this more than once. Naturally, if you have more than one topic, you can distribute partitions for different topics to brokers across the cluster. Not every broker needs to hold the same set of topics.</p><div class="document_sidebarcontainer__1fEfN"><h3 id="3-a-topic-is-just-like-a-table">3. A topic is just like a table</h3><p>Absolutely not. Topics support multiple data types (see above), they do not support mutable operations (there’s no update, only inserts and subsequent deletes).</p><p>Topics should be thought of as a “slice of state” (to borrow from the React.JS community’s writings on flux libraries). That slice can contain a duplicate of data held elsewhere, that’s fine. Normalisation is not important, what is important is the ordering of the events relative to one another within the topic.</p><span class="document_sidebar__QQejK"><h2 id="partitioning-and-in-order-delivery---an-example">Partitioning and in order delivery - an example</h2><p>If you were streaming tweets, and had included a particular type of message indicating that a tweet was deleted, it would be important to ensure that the original tweet and the record indicating that it was deleted resided within the same partition.</p><p>Naturally, this causes a deal of complexity. Imagine a system with users, tweets and tweet deletion functionality. If each were stored in a database table, and we did change data capture (CDC) to track changes to the underlying rows, a naive approach might;</p><ol><li>Push these into seperate Kafka topics with their own defined data structures.</li><li>Key the events according to the database keys.</li></ol><p>Both would be wrong. There should be a single topic and a key that makes sense across all three entities (probably a user ID given it is the highest level entity). In reality, to ensure we retained consistency we would need to ensure that eventually all data related to a user was mapped to a single partition. If users were on a different partition to Tweets it would be impossible to ascertain if user data had changed before or after a tweet, but if tweets and deletions were on seperate partitions, it would be impossible for a given recipient of a tweet to determine whether a tweet had been deleted.</p><p>Event sourcing is a set of architectural patterns which can help resolve these issues. Under an ES approach, we’d ensure that any changes from the tweet and deletion tables were joined to some user key field prior to being sent to Kafka. The user is the actor entity in this situation, and only a single user can create a tweet. <strong>Multiple primary entities becomes even more complex.</strong> If we were talking about DMs between users, both need to be considered primary and ES might advocate that we replicate the data under two keys, one belonging to each user. In this case, we’d say that only a single user can create or receive a DM, and would probably create <code>SentMessage</code> and <code>ReceivedMessage</code> event types which would hold mirror images of the same data. The important thing in both cases is that there is a primary actor entity which can be referred to with a single key so that its data can be routed to the correct partition.</p><p><strong>As these examples demonstrate, it is normal to see data replicated and stored in a very denormalised format within event based systems.</strong></p></span><h3 id="4-oh-but-when-it-is-log-compacted-then-a-topic-is-just-like-a-table-right">4. Oh, but when it is log compacted, THEN a topic is just like a table, right?</h3><p>No, still no. Tables are still tables and topics are still topics. One is for normalising subsets of that data to avoid duplication (if you’re using an RDBMS; if you aren’t, you don’t have tables) and one is for ensuring an ordering of subsets of that data. They are not the same, there has been no prestidigitation performed, no rabbits have been pulled out of hats.</p><p>For all the reasons above, this is again incorrect. And when you update a key in the log compacted table, you are doing an <code>INSERT</code> followed by a <code>DELETE</code> some unspecific length of time later. You are not, under any circumstances, doing an <code>UPDATE</code>.</p><h3 id="5-a-topic-should-contain-a-single-type-of-data">5. A topic should contain a single type of data</h3><p>I’m guilty of this one, and it wasn’t until I started asking why the registry allowed <a href="https://www.confluent.io/blog/put-several-event-types-kafka-topic/">multiple schemata per topic</a> that I figured it out. As you can see from the sidebar, there are situations where you’ll need to have multiple event types in a single partition - mostly when those event types relate to each other in some clearly defined way which is dependant on their ordering. You can’t have a refund before you’ve made a sale, you can’t delete an email before you’ve received it, you can’t end a call or web session before you’ve commenced it.</p><p>The events indicating that these things have happened will often have different schemata and (if you use a type heavy language like Scala) different types; as the fields required might be quite different. Those required to describe a refund will clearly differ from those required to describe a sale. But there is no way that a refund should ever occur in advance of a sale, so they really should be in the same topic (unless you really like being a victim of fraud, in which case go for gold).</p><p>This is further evidence of how a topic is not-like-a-table, so people who subscribe to the former fallacy are probably also vulnerable to the latter..</p><p>*<em>Note that if you do have to take this path, it should be a conscious decision, and you should probably turn off automatic schema creation in your Kafka Producer. Where this functionality is not required, the schema registry should be configured to only allow one schema. Failing to do so (especially if working in a team) means that any change to the schema can break compatibility for any producers which haven’t received the change. It is painful.</em></p><h3 id="6-we-can-join-data-between-different-topics">6. We can join data between different topics</h3><p>As we’ve established in depth above, topics are for ordering events. We often need to both;</p><ol><li>Think carefully about partitioning strategy to ensure messages are on the same partition where their ordering matters.</li><li>Allow several varieties of messages on a single topic, even if this means allowing several Avro schema for that topic.</li></ol><p>This is particularly unfortunate when a single topic has been split into several, and someone then decides that recombining them is then required. Kafka Streams claims to do roughly this, but the documentation glosses over all of these nuances in favor of bullshit. The reality is that this is really not possible. For example, if additional processing steps (or simple network latency) delay the messages from one stream, joins may be missed. Kafka streams can use the timestamps on the messages in various ways, but it cannot use them to reimpose proper ordering on a stream which has been split. <em>And if that exact wording could be added to the doco I think we’d see far fewer poor implementations, bluntly.</em></p></div><p>So there you go, a bunch of ways to stuff up data systems based on Kafka. If you have seen anything else terrible, hilarious, or outright scary when developing systems on Kafka, let me know on <a href="https://twitter.com/MilesGarnsey/">Twitter</a> and we can swap war stories.</p>]]></content:encoded>
</item>
<item>
<title><![CDATA[DS III - Making a Data Plane Fly.]]></title>
<link>/posts/2019-09-10/Data-Engineering-Pt-3</link>
<guid>/posts/2019-09-10/Data-Engineering-Pt-3</guid>
<content:encoded><![CDATA[<p>OK, now we have something resembling a platform that we can develop against, let’s talk about how our application will actually work (finally, you gasp). This is an application which will collect data from Twitter and make it available to others.</p><p><strong>At the top level, we need to think about;</strong></p><ol><li>How long we want to store the data for - is it just until the next service has time to access it, or is it a permanent record?</li><li>Do the individual records have any sort of history (for example, how do we treat edits to Tweets) and do we want to retain all of it?</li><li>Does the structure of the data change? If so, how do we manage changes to the data structure such that consumers expecting a previous structure can consume the new structure?</li><li>Velocity, variety and volume of data.</li></ol><p><strong>Translating through to a technical design -</strong></p><ol><li>We want to store the data forever, this is a data storage system, not a messaging system.</li><li>For the sake of simplicity, we will overwrite previous versions of tweets.</li><li>We do not control the data structure, and should therefore expect it to change.</li><li>We really only have one type of data present, which is tweets and we can scale for velocity and volume using more instances in our Kafka consumer group (but see the note in italics below, and my forthcoming post on Kafka for some roadbumps in this story.).</li></ol><p>Translating through to the implementation - (1) and (2) suggest that what we want is <strong>the most recent version of each record emitted stored forever - so we’ll need to set our Kafka topics to use log compaction</strong>. We then just need to ensure that the duplicated data is retained for long enough that all consumers can pick it up before old versions are deleted.</p><p>Requirement (3) can be addressed by using <strong>Avro’s schema evolution (next post)</strong>, while <strong>velocity is addressed by using Kafka</strong> (throughput can be increased by horizontal scaling of brokers and application instances), volume is the same story, and the variety concern is not very relevant here.</p><p><em>NB: I believe that there is something in the developer’s agreement with Twitter to the effect that tweets need to be deleted from your app if a corresponding <a href="http://danielasfregola.github.io/twitter4s/6.1/api/com/danielasfregola/twitter4s/entities/streaming/common/StatusDeletionNotice.html"><code>StatusDeletionNotice</code></a> if received. I am not touching on this, because my assumption is that you are running a temporary environment which is deleting <em>ALL</em> data on every restart. If this is not the case, and you get sued or something, you can’t say I didn’t warn you…</em></p><div class="document_sidebarcontainer__3KzLz"><span class="document_sidebar__363Ye"><h3 id="design-choices">Design Choices</h3><p>Given the way Kafka works we have a few choices we need to make -</p><ol><li>Number of topics - In this case we know we have a single topic and there are apparently few decisions to make. See the previous post <a href="../../2019-09-06/About-Kafka">about Kafka</a> for some of the intricacies that influence this decision.</li><li>Number of partitions - This will set the maximum number of application instances we can ever deploy to scale horizontally, and will also set the maximum number of Kafka brokers which can contribute to data storage.</li><li>Number of replications - Because we only have a single broker on our test platform, we can only have one rep, but we would have more choice in a production cluster and should consider this at development time.</li><li>ISR settings - If a write is in progress and we have one replica on a single broker, do we acknowledge to the producer that the data is written, or wait for the data to replicate to more brokers to ensure that a broker failure can’t cause data loss?</li><li>Number of application instances - This is often easier to scale in production than is the number of partitions.</li></ol><p>As with all things distributed, we should always think of a world in which our cluster is built from one hundred slow machines, rather than 10 fast ones. Generally, smaller instances cost less on the cloud, so the better you can distribute your application the more money you save. It also helps with scaling, should you Enhance Shareholder Value sufficiently to Drive the Step Change and Experience Hockey-Stick Growth. Yay.</p></span><span><h3 id="retention">Retention</h3><p>We’re using log compaction to retain the most recent iteration of each tweet, in contrast to the standard cleanup strategy - delete - log compaction keeps the most recent value for each key. Log compacted topics have a <code>cleanup.policy</code> set to <code>compact</code>, but the conditions under which the compaction actually takes place are a bit confusing. The <code>delete</code> option would use <code>retention.bytes</code> (which controls maximum topic size, and is disabled with a value of -1 by default) as well as <code>retention.ms</code>, which controls the maximum age of a record before deletion occurs (and is not to be confused with <code>delete.retention.ms</code>).</p><p>But when we use a compaction cleanup strategy, it actually uses <code>min.cleanable.dirty.ratio</code> - the proportion of log space consumed by old records as a value between 0 and 1. Complicating the situation further is that no compaction can take place before the value specified in <code>min.compaction.lag.ms</code>, which we can use to our advantage to ensure that consumers have an opportunity to read the data before it is compacted. Note that what is missing is an ability to set a corresponding <code>max.compaction.lag.ms</code>, so don’t think of log compaction as a way to fake a database <code>UPDATE</code> operation (a common error), it isn’t - it’s an <code>INSERT</code>. For more ways you can think incorrectly about Kafka, see the <a href="../About-Kafka">previous post</a>.</p><h3 id="managing-dev-and-production-configs">Managing dev and production configs</h3><p>So we have a pretty good idea of the low level/transport layer for the data now, and we’ve come up with a way to use Kafka to address points one and two from our technical design to get data across the cluster in a reasonable and safe way. Refer to twitter-topic.conf for a summary of the appropriate settings, if you’ve been following the above they should make sense.</p><figure class="document_figureleft__1kQb0"><img src="/static/media/stern.5f3e1298.gif" style="width:auto;max-width:inherit;height:auto" class="document_image__rZXyh" alt="Guy laughing."/><figcaption>Yes, all of our infrastructure is code deployed in CI/CD.</figcaption></figure><p>The only problem is that we only have a single broker on our cluster, so how can we have three replications? Trying to create this topic will result in an error. I’m leaving it that way, because that’s how we’d like to see it in production, where we won’t be able to change that value - because we’ll be using source control or a CI/CD pipeline to push our changes and definitely won’t be creating any topics by hand… right..?</p><p>I may post about building a tool to parse and apply these files later, but overriding these kinds of configs in our dev environment is what etcd is there for (among other things), and we’ll cover that in a future post.</p></span></div><div style="width:100%;clear:both"></div><h2 id="producing-messages-to-kafka-from-scala">Producing messages to Kafka from Scala</h2><p><strong>Grab the Github repo from <a href="https://github.com/Miles-Garnsey/DataEngineeringPt3">here</a>.</strong> If (like me) you’re really a Python engineer and not wholly impressed with the whole idea of compilation generally, you’ll probably find this folder structure to be a dog’s breakfast. Sorry.</p><h2 id="one-click-deploy---wow">One click deploy - Wow.</h2><p>I’m even more sorry to report that this is a standard Scala project, with a structure common to those using the fairly ubiquitous(ly derided) build tool <code>sbt</code>. Do not blame me, I do not make the rules. While other options (Ammonite especially) exist, we couldn’t demo all of the Avrohugger things we’d like to from there, so we may as well get the pain over with early.</p><p>You can install <code>sbt</code> yourself, Google is your friend if you have difficulty. You’ll also likely want an IDE suited to Scala, your main options being <a href="https://www.jetbrains.com/idea/">IDEA</a>, <a href="https://code.visualstudio.com/">VS Code</a> or <a href="https://atom.io/">Atom</a> with <a href="https://www.scala-lang.org/2019/04/16/metals.html">Metals</a>, or hacking something together using <a href="https://ammonite.io/">Ammonite</a>, an editor, and some way to work with an SBT file (which might just be manually copying stuff given the top hit for “sbt ammonite” on Google is <a href="https://github.com/alexarchambault/sbt-ammonite">broken</a>).</p><p>None of these options is great, but IDEA is quick to get running (although it frequently produces false positive errors and other strange behaviours).</p><div class="document_sidebarcontainer__3KzLz"><span class="document_sidebar__363Ye"><h3 id="wheres-my-code-five-levels-deep">Where’s my code? Five levels deep.</h3><figure class="document_figureright__hCBOD"><img src="/static/media/inception.5580458b.gif" class="document_image__rZXyh" alt="Inception."/><figcaption style="font-size:smaller">Are we still in the project directory?</figcaption></figure><p>Ah… Yes. We all know that one of the best ways to convince your boss that you’ve done some work is to completely bamboozle them, and often some sort of directory structure Inception is just the ticket. Scala lore therefore institutes unquestionable rules stating that:</p><ol><li>Only one class, in only one file, shall exist per folder.</li><li>That every class shall exist in its own namespace, inside one of those folders.</li><li>That the namespaces defined in <code>package</code> statements and folder structure shall reflect each other perfectly.</li></ol><p>Follow them, and I guarantee that by the time your boss has finished clicking through the endlessly nested hierarchy of filesystem goodies, they’ll have forgotten what they were looking for in the first place; and will not ask you questions. Truly, Scala is an enterprise-fit anguage.</p><p>I note that I have severely violated these rules in this tute by basically chucking everything in the same folder (<code>main.scala.io.github.milesgarnsey.blog</code>). This has the pleasing upside that I don’t have to import my own code, and suits my purposes in that my reader may actually understand some of what is going on.</p></span><p>To start with, have a look at the build system files - especially;</p><ol><li><code>build.sbt</code></li><li><code>./project/plugins.sbt</code></li><li><code>./project/.sbtopts</code></li></ol><p>(1) defines the libraries you want to use, and (in our case) (2) defines an <code>sbt</code> plugin to build Docker containers. Finally, (3) saves us a lot of head scratching down the line when we start hitting mysterious heap errors - which is something that can happen during both compilation and run phases, thanks to the joys of the JVM. <code>sbt</code> creates plenty of additional cruft (which is both distracting and infuriating), but most of it can be ignored.</p><p>If we’re modern developers, it would be nice if we could get this into our cluster relatively seamlessly, but (from the second post in this series) we know that there are some complexities behind getting docker images into minikube, let alone unpackaged application code.</p><p><strong>What we didn’t mention, is that to access the docker daemon in minikube, we can actually run <code>eval $(minikube docker-env)</code>, which will then direct the commands we’re typing into our regular Docker client back into the Minikube VM.</strong></p><pre><code class="language-bash" data-language="bash" data-highlighted-line-numbers=""><span class="token builtin class-name">eval</span> <span class="token variable"><span class="token variable">$(</span>minikube docker-env<span class="token variable">)</span></span>
<span class="token punctuation">(</span>cd <span class="token punctuation">..</span> <span class="token operator">&&</span> sbt docker:publishLocal<span class="token punctuation">)</span>
kubectl --namespace<span class="token operator">=</span>dev delete pods,replicasets.apps -l <span class="token assign-left variable">app</span><span class="token operator">=</span>data-engineering-blog
kubectl apply -f deploy.yaml
kubectl logs -f -p -l <span class="token string">"app=data-engineering-blog"</span> --all-containers<span class="token operator">=</span>true
</code></pre><p>So because we also have the <code>sbt</code> packager set up to use Docker, we can actually just run <code>sbt docker:publishLocal</code> and <code>sbt</code> will publish to Minikube. I’ve included a K8s manifest to ensure that this all works correctly once we hit deploy. More likely, you will just run the script above - <code>deploy.sh</code> without thinking too much about it. <strong>Congratulations, laziness is becoming in a data scientist, and you’re developing well in this regard.</strong></p><p>We could get a lot fancier and do this with CI/CD servers which hook up to Github (GoCD, Jenkins spring to mind), and private Docker Registries which receive their output images and redeploy the app. There are some nice tools that do all of this stuff in a single place (Gitlab for example), but really, what we have here is pretty effective, and almost zero config is required. When you’re moving jobs frequently (e.g. if you have the joy of being a consultant, contractor, or other variety of gun for hire as I do), simple deployment pipelines that rely mostly on basic linux tools are great.</p><p>Now let’s have a look at this compiled-language-business shall we? Our <code>sbt</code> build definition told us where the main class can be found - <code>Compile / mainClass := Some("main.scala.io.github.milesgarnsey.blog.Entrypoint")</code>, let’s start there.</p></div><h2 id="scala-an-entirely-rational-and-very-consistent-language">Scala: an entirely rational and very consistent language</h2><p>First off, we can see that our <code>Entrypoint</code> is an <code>object</code> (not a class, because we need something which is a singleton to launch the <code>App</code>). It extends <code>LazyLogging</code> which is why we can call methods of <code>logger</code> without declaring it. Go research these terms if you aren’t familiar with them. Formalities dealt with, we’re then loading some configs via <code>val config = ConfigFactory.load()</code>.</p><pre><code class="language-scala" data-language="scala" data-highlighted-line-numbers=""><span class="token keyword">object</span> Entrypoint <span class="token keyword">extends</span> App <span class="token keyword">with</span> LazyLogging <span class="token punctuation">{</span>
logger<span class="token punctuation">.</span>info<span class="token punctuation">(</span><span class="token string">"Initialising Data engineering Part III demo app..."</span><span class="token punctuation">)</span>
<span class="token keyword">val</span> config <span class="token operator">=</span> ConfigFactory<span class="token punctuation">.</span>load<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token keyword">val</span> producer <span class="token operator">=</span> <span class="token keyword">new</span> ToyProducer<span class="token punctuation">(</span>
config<span class="token punctuation">.</span>getString<span class="token punctuation">(</span><span class="token string">"kafka-brokers"</span><span class="token punctuation">)</span><span class="token punctuation">,</span>
config<span class="token punctuation">.</span>getString<span class="token punctuation">(</span><span class="token string">"output-topic"</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
<span class="token keyword">val</span> sendresult <span class="token operator">=</span> producer<span class="token punctuation">.</span>sendmessage<span class="token punctuation">(</span>config<span class="token punctuation">.</span>getString<span class="token punctuation">(</span><span class="token string">"output-message"</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
logger<span class="token punctuation">.</span>info<span class="token punctuation">(</span>sendresult<span class="token punctuation">.</span>toString<span class="token punctuation">)</span>
Thread<span class="token punctuation">.</span>sleep<span class="token punctuation">(</span><span class="token number">3600000</span><span class="token punctuation">)</span>
<span class="token punctuation">}</span>
</code></pre><p>While config wrangling is not exciting ML functionality, one of the things I realised early when working with Scala was that compilation via SBT is agonisingly slow. <strong>To avoid spending half our lives waiting for the build cycle to complete and the container to deploy (which is even slower in cloud development environments, especially for larger containers) it helps to have a config file where you can change settings quickly.</strong></p><p><em>Note that agonisingly slow means 45–120 seconds (if it is taking longer for a project like this you’re doing something wrong). This is agonising if you’re making a bunch of changes in quick succession, don’t know the language well, and want to test after each change. That would be at least 10–20 recompiles per hour, which if you multiply it out comes out to you having spend something like a quarter of your hour waiting for recompilation.</em></p><p>We instantiate a ToyProducer class and pass some of the configs - we could pass the whole config, but that would diminish the modularity of the code.</p><p>Diving into what this ToyProducer class actually does, we’ll see that it instantiates a Java <code>Properties</code> object to receive our configs, which tips us off that this may be a Java object being accessed from Scala. When we have to do this we basically lose a lot of the elegance of Scala, so the “Scala can access the Java ecosystem” argument actually isn’t a great one (unless you enjoy spending your life writing extremely awkward interfaces to loosely supported languages).</p><pre><code class="language-scala" data-language="scala" data-highlighted-line-numbers=""><span class="token keyword">class</span> ToyProducer<span class="token punctuation">(</span>bootstrapServers<span class="token operator">:</span> <span class="token builtin">String</span><span class="token punctuation">,</span> topic<span class="token operator">:</span> <span class="token builtin">String</span><span class="token punctuation">)</span> <span class="token punctuation">{</span>
<span class="token keyword">val</span> props <span class="token operator">=</span> <span class="token keyword">new</span> Properties
props<span class="token punctuation">.</span>put<span class="token punctuation">(</span><span class="token string">"bootstrap.servers"</span><span class="token punctuation">,</span> bootstrapServers<span class="token punctuation">)</span>
props<span class="token punctuation">.</span>put<span class="token punctuation">(</span><span class="token string">"key.serializer"</span><span class="token punctuation">,</span> <span class="token string">"org.apache.kafka.common.serialization.StringSerializer"</span><span class="token punctuation">)</span>
props<span class="token punctuation">.</span>put<span class="token punctuation">(</span><span class="token string">"value.serializer"</span><span class="token punctuation">,</span> <span class="token string">"org.apache.kafka.common.serialization.StringSerializer"</span><span class="token punctuation">)</span>
<span class="token keyword">val</span> client <span class="token operator">=</span> <span class="token keyword">new</span> KafkaProducer<span class="token punctuation">[</span><span class="token builtin">String</span><span class="token punctuation">,</span><span class="token builtin">String</span><span class="token punctuation">]</span><span class="token punctuation">(</span>props<span class="token punctuation">)</span>
<span class="token keyword">def</span> sendmessage<span class="token punctuation">(</span>message<span class="token operator">:</span><span class="token builtin">String</span><span class="token punctuation">)</span><span class="token operator">:</span> RecordMetadata<span class="token operator">=</span> <span class="token punctuation">{</span>
<span class="token keyword">val</span> res <span class="token operator">=</span> client<span class="token punctuation">.</span>send<span class="token punctuation">(</span><span class="token keyword">new</span> ProducerRecord<span class="token punctuation">[</span><span class="token builtin">String</span><span class="token punctuation">,</span><span class="token builtin">String</span><span class="token punctuation">]</span><span class="token punctuation">(</span>topic<span class="token operator">=</span>topic<span class="token punctuation">,</span>K<span class="token operator">=</span><span class="token string">"No key here..."</span><span class="token punctuation">,</span>V<span class="token operator">=</span>message<span class="token punctuation">)</span><span class="token punctuation">)</span>
res<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token punctuation">)</span>
<span class="token punctuation">}</span>
<span class="token punctuation">}</span>
</code></pre><p>Once we’ve gotten over the nauseating getty/settyness of the interface, the clunkiness becomes particularly clear on line 13 - try to run deploy.sh in the deploy folder… Yes, you receive an error helpfully telling you that there is something wrong with your method signature. This is because Java methods (even if called from Scala) can’t take named parameters. Remove the parameter names and try again…</p><h3 id="how-can-you-log-that-the-logging-is-broken-if-the-logging-is-broken-so-you-cant-log-that">How can you log that the logging is broken if the logging is broken so you can’t log that…</h3><p>OK, it now builds and deploys, jump into your kubernetes dashboard (<code>minikube dashboard</code> in your console will do it - please read the previous posts…), select your dev namespace and go to overview; you should see your replica set running there. But sadly the container has failed… So much for Scala catching most errors at compilation time…</p><p>Why did it fail? Who knows, your logging is broken - at least you can claim that 'we are not aware of any issues’.</p><p>The logging framework scans your classpath for logging implementations to bind to, and you have one in both <code>logback-classic</code>, as well as the helpfully named <code>SLF4J-Log4J</code>, which has been imported as a transitive dependancy (a dependancy contained in a library you are importing). Honestly, where did you even learn to Scala, you must have had a hopeless teacher.</p><p>To fix this, we need to go back to our SBT file and seperate the <code>Seq</code> containing our library dependancies into two, one of which explicitly excludes the SLF4J organisation -</p><pre><code class="language-scala" data-language="scala" data-highlighted-line-numbers="">libraryDependencies <span class="token operator">++</span><span class="token operator">=</span> Seq<span class="token punctuation">(</span>
<span class="token string">"org.apache.kafka"</span> <span class="token operator">%</span><span class="token operator">%</span> <span class="token string">"kafka-streams-scala"</span> <span class="token operator">%</span> <span class="token string">"2.2.0"</span><span class="token punctuation">,</span>
<span class="token string">"io.confluent"</span> <span class="token operator">%</span> <span class="token string">"kafka-avro-serializer"</span> <span class="token operator">%</span> <span class="token string">"3.1.2"</span><span class="token punctuation">,</span>
<span class="token string">"org.apache.kafka"</span> <span class="token operator">%</span> <span class="token string">"kafka_2.9.2"</span> <span class="token operator">%</span> <span class="token string">"0.8.2.2"</span><span class="token punctuation">,</span>
<span class="token string">"com.typesafe.scala-logging"</span> <span class="token operator">%</span><span class="token operator">%</span> <span class="token string">"scala-logging"</span> <span class="token operator">%</span> <span class="token string">"3.9.2"</span><span class="token punctuation">,</span>
<span class="token string">"com.typesafe"</span> <span class="token operator">%</span> <span class="token string">"config"</span> <span class="token operator">%</span> <span class="token string">"1.3.4"</span><span class="token punctuation">,</span>
<span class="token string">"ch.qos.logback"</span> <span class="token operator">%</span> <span class="token string">"logback-classic"</span> <span class="token operator">%</span> <span class="token string">"1.0.13"</span><span class="token punctuation">)</span>
</code></pre><p>becomes -</p><pre><code class="language-scala" data-language="scala" data-highlighted-line-numbers="">libraryDependencies <span class="token operator">++</span><span class="token operator">=</span> Seq<span class="token punctuation">(</span>
<span class="token string">"org.apache.kafka"</span> <span class="token operator">%</span><span class="token operator">%</span> <span class="token string">"kafka-streams-scala"</span> <span class="token operator">%</span> <span class="token string">"2.2.0"</span><span class="token punctuation">,</span>
<span class="token string">"io.confluent"</span> <span class="token operator">%</span> <span class="token string">"kafka-avro-serializer"</span> <span class="token operator">%</span> <span class="token string">"3.1.2"</span><span class="token punctuation">,</span>
<span class="token string">"org.apache.kafka"</span> <span class="token operator">%</span> <span class="token string">"kafka_2.9.2"</span> <span class="token operator">%</span> <span class="token string">"0.8.2.2"</span><span class="token punctuation">,</span>
<span class="token string">"com.typesafe.scala-logging"</span> <span class="token operator">%</span><span class="token operator">%</span> <span class="token string">"scala-logging"</span> <span class="token operator">%</span> <span class="token string">"3.9.2"</span><span class="token punctuation">,</span>
<span class="token string">"com.typesafe"</span> <span class="token operator">%</span> <span class="token string">"config"</span> <span class="token operator">%</span> <span class="token string">"1.3.4"</span><span class="token punctuation">)</span><span class="token punctuation">.</span>map<span class="token punctuation">(</span>_<span class="token punctuation">.</span>exclude<span class="token punctuation">(</span><span class="token string">"org.slf4j"</span><span class="token punctuation">,</span> <span class="token string">"slf4j-log4j12"</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">,</span>
libraryDependencies <span class="token operator">++</span><span class="token operator">=</span> Seq<span class="token punctuation">(</span><span class="token string">"ch.qos.logback"</span> <span class="token operator">%</span> <span class="token string">"logback-classic"</span> <span class="token operator">%</span> <span class="token string">"1.0.13"</span><span class="token punctuation">)</span>
</code></pre><p>The meaning should be clear to anyone who understands <code>map</code>ping an array in any language. We apply the same operation (<code>exclude</code>) to every element (<code>_</code> in Scala, which allows us to anonymously refer to the elements being mapped, sadists, and those of us who prefer clarity at the expense of brevity might prefer the verbosity of <code>x=>x.exclude(...)</code>). We then add <code>logback-classic</code> back in without the exclusion.</p><h3 id="wheres-kafka-running-again">Where’s Kafka running again?</h3><p>Redeploy again and check the logs…</p><p>Oh no… Network errors from Kafka - that doesn’t look too healthy. If we look closely we can see that it is failing to talk to localhost - but hang on, that doesn’t seem quite right… In production, the applications talking to Kafka shouldn’t be on the same machine as Kafka itself - our dev environment has been set up to replicate this hasn’t it? So change the application.conf to talk to our Kubernetes Kafka service at <code>kafka-service.dev</code> instead of localhost.</p><pre><code class="language-conf" data-language="conf" data-highlighted-line-numbers="">app-name="data-engineering-blog"
kafka-brokers="kafka-service.dev:9092"
schema-registry="kafka-service.dev:8081"
output-topic="coyote-test-avro"
output-message="I'm a message!!"
</code></pre><p>Redeploy, check the logs - all looks pretty healthy… And if you use the <code>kubectl port-forward --namespace=dev service/kafka-service 3030:3030</code> command then head across to <code>localhost:3030</code> to access the Landoop UI, you’ll see there are some messages in your Kafka topic.</p><figure styles="[object Object]"><img src="/static/media/Post 3 - Application running.dd368ed9.png" class="document_image__rZXyh" alt="Application running in Kubernetes"/><figcaption>It works, I took a screenshot to prove it.</figcaption></figure>]]></content:encoded>
</item>
</channel>
</rss>