Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review #1

Open
wants to merge 27 commits into
base: empty
Choose a base branch
from
Open

Review #1

wants to merge 27 commits into from

Conversation

NissesSenap
Copy link
Owner

No description provided.

* Add minor go code, super broken!
* Add NOTES.md for me
* Update command and args and use env instead
* Fix go code...
* Set SA correct in triggerTemplate
* Set ful path in poddeleter command, tekton changes the WORKSPACE
* Ignore poddeleter in falco so we don't get a loop of delete pods
* Add go.sum
* Clean up NOTES.md
* Add LICENSE
Copy link

@bittrance bittrance left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review: the readme proved tricky to review on my phone.

I think I would have split the go main function into two functions: one to create the client and one two maybe kill the pod (i.e. it would take a client, a blacklist and a falco event and maybe do something). That should make it possible to write a few unit test for the latter.

Makefile Show resolved Hide resolved
main.go Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
critical = true
break
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe my calendar says it is 2021. Surely there is a set or hashmap we can use somewhere?

Copy link

@bittrance bittrance May 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than use an array and iterate over it, we can use a map. Consider:

CriticalNamespaces := map[string]bool {
    "kube-system": true,
    "kube-public": true,
    "kube-node-lease": true,
    "falco": true,
}
// ...
if CriticalNamespaces[ns] {
    log.Printf("Protected namespace %s", ns)
    return
}

Golang has no set type so we will have to abuse a map, using bools as values. Once we have a map, we can ask for stuff in it and we do not need to iterate at all. More about maps here https://blog.golang.org/maps.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh this was what I couldn't really wrap my head around. What I should use for key/value but a simple bool seems like a easy enough idea.

I guess you could argue that the []string is easier for people to understand + in the future when we support having criticalNamepsaces as a input variable that probably would be a []string we would need to do some formatting from the input in to the hash map.

Once again way to long in the future for this issue but still.

But I will update it so I remember how to do it in the future :)

main.go Outdated Show resolved Hide resolved
log.Fatalf("Unable to delete pod due to err %v", err)
os.Exit(1)
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that you try to create the client first, so you get failure even if we don't need to act. Of course, that assumes there are no kube-system (i.e. cluster-level) events that shows up here when the API is temporarily unavailable (i.e. events that should be ignored but we don't get to the ignoring part because we fail).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done a break out now, please have a look.

tekton/task.yaml Outdated Show resolved Hide resolved
tekton/task.yaml Outdated Show resolved Hide resolved
tekton/task.yaml Outdated Show resolved Hide resolved
tekton/trigger-template.yaml Outdated Show resolved Hide resolved
Copy link

@simongottschlag simongottschlag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool!! 😁😁

config:
webhook:
address: http://el-falco-listener.falcoresponse.svc.cluster.local:8080
customHeaders: Falcon:true\,Stuff:yes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be Falcon or Falco?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been testing with a few things earlier, I should probably delete it. I would like to get a CEL filter working but tekton have changed the API and got rather tiered when it didn't work.
For now I will keep it just to show that you can add a customHeader. The reason why I wrote falcon was that I wanted to be 100% sure I didn't overwrite any headers that is included in falco.

main.go Outdated
}

func main() {
var CriticalNamespaces = []string{"kube-system", "kube-public", "kube-node-lease", "falco"}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be nice with ability to append more namespaces from environment variables here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simongottschlag That would indeed be useful, but remember that we are reviewing code already on master and should be very careful about feature growth. If we introduce env var for this it should be its own PR.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look in the bottom of the README you will see a conclusion where I talk about the need for this feature.
The reason why I don't add it is because I want the code to have a quick overview.
This code is just for this blog, we had a small discussion about writing a bigger application to be able to manage these kind of loads in the falco slack channel. https://kubernetes.slack.com/archives/CMWH3EH32/p1619869899307600?thread_ts=1619847638.297000&cid=CMWH3EH32

So lets see how that evolves in the futre.

main.go Outdated
Comment on lines 36 to 39
bodyReq := os.Getenv("BODY")
if bodyReq == "" {
panic("Need to get environment variable BODY")
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how big the body can be, but there are limits (even though large) to environment variables. Can this be read from a file instead?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I wonder if encoding is correctly handled through the chain?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the event won't be suuuuper big so we shouldn't hit any issues.
Yeah I have been able to trigger a bug, instead of using uptime in the exec and used:

echo '"please dont kill me"'

The json got screwed up and I was unable to parse to input.

At this time I don't think it's worth making it better. But to make this production ready we should see if we can find another way of sending data to our pod through the event-listener or we should handle a bunch of inputs that can create issues.

main.go Outdated

bodyReq := os.Getenv("BODY")
if bodyReq == "" {
panic("Need to get environment variable BODY")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal preference: never panic, print error to stderr and os.Exit(1).

I would put most of the code in a func run() error {} and return early on this and then in main do something like:

func main() {
  err := run()
  if err != nil {
    os.Exit(1)
  }
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

panic and os.Exit are different things. Read e.g. here https://stackoverflow.com/a/28473339/205674. For example os.Exit will not run you defers. Cleanup and unwinding may not matter a lot in a simple program like this, but if you want to be conscientious, I think this guy has the right idea https://stackoverflow.com/a/46255965/205674 .

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, Simon demonstrates proper use of os.Exit in his PR https://github.com/simongottschlag/mqtt-log-stdout-v2/pull/1/files#diff-9edc738f3908a55100ae0f3f1adbb25e9b341052e68eebf7e792440bdf06ec8aR30 . Here, the main function can be trivially judged not to require cleanup, since that happened at a lower level already; thus we can safely os.Exit.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think short is swat in this case but it's a good point for the future.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Go Time (podcast) episode goes into it quite well: https://changelog.com/gotime/165

main.go Outdated Show resolved Hide resolved
tekton/pipeline.yaml Outdated Show resolved Hide resolved
main.go Outdated

if !critical {
log.Printf("Deleting pod %s from namespace %s", podName, namespace)
err := kubeClient.CoreV1().Pods(namespace).Delete(context.Background(), podName, metaV1.DeleteOptions{})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we force delete? Or maybe make it configurable.

Should the context have a timeout?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the use case is terminating suspicious pods, it may be that there should be some form of ops notification in case a pod resists arrest? (Would be new PR, tho.)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm about force delete I will skip it to keep it shorter.

About context it's a good idea. I just don't know how the k8s client works in go well enough.
I assume that even if the context timesout and the application stops freezing the k8s API will continue to try to delete the pod just like it would if you wrote a kubectl delete command.
If so It might not be worth adding it since the call have gone through.

My initial thought is to do something like this. Is 5 seconds to long?

	ctx, cancel := context.WithTimeout(context.Background(), time.Duration(time.Second*5))
	defer cancel()
	err := kubeClient.CoreV1().Pods(namespace).Delete(ctx, podName, metaV1.DeleteOptions{})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, deleting pods is about changing the ideal state and letting Kubernetes try to reconcile that state. I think it makes sense that this process waits a bit to see if the delete is successful, but we do not want to keep trying forever or this will itself be a vector for DoS. Probably, the process should wait a bit, log whether it succeeded or timed out and move on.

main.go Outdated Show resolved Hide resolved
It's a less general name
Copy link

@simongottschlag simongottschlag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments for the readme

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated

This was a rather simple example on how we can use the power of tekton together with Falco to protect us from bad actors that is trying to take over pods in our cluster.

As noted during this post there are allot of potential improvements before this is production ready, for example:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a lot
as an example

Copy link

@simongottschlag simongottschlag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed that Golang 1.15 was used to build

Dockerfile Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Comment on lines +35 to +40
criticalNamespaces := map[string]bool{
"kube-system": true,
"kube-public": true,
"kube-node-lease": true,
"falco": true,
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually fail early here if it was a critical namespace and skip passing it to the deletePod() function.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't both you and @bittrance happy on this one :)
He asked me earlier to break out to delete and if statement in a separate function. At least if I understood him correctly.

The question is, should we test to see if we are in k8s first or should we check if we aren't in a protected namespace first?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a process should generally ascertain its ability to function properly before starting any work. Consider a scenario where it is difficult for me to provoke a Falco event that results in a dead pod. In that scenario, if we are taking decisions (i.e. early return on protected namespaces) before we done all the setup, it may take a long time before we discover that we have bad credentials. This is basically the fail-fast principle applied to software design.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was what I meant: #1 (comment)

I was thinking that this could be done earlier, but making sure the credentials work seems like a good idea before. 👍

main.go Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
main.go Show resolved Hide resolved
if criticalNamespaces[namespace] {
log.Printf("The pod %v won't be deleted due to it's part of the critical ns list: %v ", podName, namespace)
return nil
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous review comment may have been confusing, but I think the namespace check should go "around" this method rather than inside it.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so if it's not present call the function to kill the pod?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, something like that. In main(), I think I would do something like:

if criticalNamespaces[namespace] {
    log.Printf("The pod %v won't be deleted due to it's part of the critical ns list: %v ", podName, namespace)
    return
}
err = deletePod(kubeClient, falcoEvent, criticalNamespaces)
if err != nil {
    log.Fatalf("Unable to delete pod due to err %v", err)
}

return nil
}

log.Printf("Deleting pod %s from namespace %s", podName, namespace)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not yet published my magnus opus on logging statements, but I prefer one "info" level log entry per "request". This method has three. Also, logging should in my opinion be written with the assumption that the code works (you use tests to verify that the code works) which means that the request logging goes after the operation, documenting that it happened. Consider:

log.Printf("Do stuff")
doStuff()

vs

doStuff()
log.Printf("Did stuff")

The doStuff() function is almost guaranteed to be more complex than the logging. Thus, if you see "Did stuff" in your log, you are a lot more sure that stuff actually happened than if you see "Do stuff". The former style is usually a product of a developer trying to find out if code works and should not go into production - presumably you found out that it works or you would not put the code in production.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes "ingress" logging is motivated, but should generally be at debug or trace level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants