Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial changes for invoker cluster backpressure #3

Open
wants to merge 62 commits into
base: master
Choose a base branch
from

Conversation

tysonnorris
Copy link

Description

Related issue and scope

  • I opened an issue to propose and discuss this change (#????)

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

Copy link

@chetanmeh chetanmeh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would really address the dynamic capacity addition aspect! Do we have the change in MesosContainerFactory somewhere which trigger the NodeStatsUpdateEvent? That would help to see complete picture once

It may be better to extract out a ResourcePolicy abstraction to keep the policy aspect separate from the scheduling purpose

//if cluster managed resources, subscribe to events
if (poolConfig.clusterManagedResources) {
logging.info(this, "subscribing to NodeStats updates")
Events.subscribe(self, NodeStatsUpdateEvent)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it unsubscribe itself? Given its kind of singleton its fine but may be for testcases we should unsubscribe

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like EventStream will handle unsubscribe on actor termination per docs, but will change this if it is a problem for tests.

* Publishes the payload of the MsgEnvelope when the topic of the
* MsgEnvelope equals the OWEvent specified when subscribing.
*/
object Events extends EventBus with LookupClassification {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can possibly use the EventStream of the ActorSystem itself. Should meet our requirement

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! will look into making this change

@tysonnorris
Copy link
Author

I'm interested to know what you have in mind for ResourcePolicy abstraction - it may help once I get the mesos PR setup, will do that today.

@codecov-io
Copy link

codecov-io commented Mar 1, 2019

Codecov Report

Merging #3 into master will decrease coverage by 36.62%.
The diff coverage is 35.21%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master       #3       +/-   ##
===========================================
- Coverage   79.42%   42.79%   -36.63%     
===========================================
  Files         170      173        +3     
  Lines        7940     8256      +316     
  Branches      532      582       +50     
===========================================
- Hits         6306     3533     -2773     
- Misses       1634     4723     +3089
Impacted Files Coverage Δ
...pache/openwhisk/core/containerpool/Container.scala 69.62% <ø> (-13.93%) ⬇️
...cala/org/apache/openwhisk/core/yarn/YARNTask.scala 0% <ø> (-70%) ⬇️
...sk/core/containerpool/docker/DockerContainer.scala 78.04% <ø> (-9.76%) ⬇️
...penwhisk/core/containerpool/ContainerFactory.scala 85.71% <ø> (ø) ⬆️
...la/org/apache/openwhisk/core/mesos/MesosTask.scala 0% <ø> (-73.81%) ⬇️
...inerpool/AkkaClusterContainerResourceManager.scala 0% <0%> (ø)
...in/scala/org/apache/openwhisk/common/Logging.scala 58.88% <100%> (-10.87%) ⬇️
.../containerpool/LocalContainerResourceManager.scala 100% <100%> (ø)
.../core/containerpool/ContainerResourceManager.scala 100% <100%> (ø)
.../scala/org/apache/openwhisk/core/entity/Size.scala 59.7% <16.66%> (-29.78%) ⬇️
... and 133 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7a304c1...a2a1844. Read the comment docs.

@chetanmeh
Copy link

Was thinking of trait which abstracts the hasPoolSpaceFor method (like here). However current logic closely match the reservation and I am getting bit confused between its role and free and busy pool.

However idea would be to move your current changes to such a policy and then use that in ContainerPool decision logic. One aspect which makes it tricky is that policy need to be refreshed via event. So we may need to use some volatile state which gets updated after post processing of cluster state changes and then consumer by the hasPoolSpaceFor logic

@tysonnorris tysonnorris changed the base branch from ow-invoker-backpressure to master March 5, 2019 23:47
chetanmeh and others added 22 commits May 13, 2019 13:54
…e#4475)

* Additional cleanup and simplifications for the ActionLoop docs

* Additional cleanup and simplifications for the ActionLoop docs
* Updating runtimes to include new Node.js v12 image.

- Updated runtimes manifest
- Added API documentation
- Minor updates to docs
- Added automated test case

* Fixing accidental default flag for v12
* Update docker version in the controller and guides
* Update travis configuration to install docker=18.06.3
* Separate docker installation from the travis setup script
* Add docker setup script
* Add CoordinatedShutdown to cleanup runtime containers
* add a configuration for root runc dir
* Disable runc use in Jenkins environment
* Add comments which explain the correlation among the version docker and runc and the type of user
* Reenable Docker remote API again
…ame (apache#4488)

Add `pip install docker` to Dockerfile for ow-utils to fix problem pulling docker images
* Switch to consistent indexing policy
* Remove reference to Range index
* Tweak indexing policy comparison to only check for included and excluded path
Do not check for Index type as now there is only one which is Range for our cases
* Use implicit logger
* Excluding root path should be using `/*` instead of `/`
Updates to Alpakka S3 Connector v1.0.1 release.

This commit also includes some fixes related to apache#4484 which were causing compatibility issues with existing setup. For existing setup indexes were still using `Hash` indexing and that caused failure while creating `IndexingPolicy` upon collection read. So added back support for `Hash` index but not using them to create new indexes now
Record collection usage stats for CosmosDB so as to enable tracking the growth of collection in  terms of storage size, document count and index size over the period of time. It also enables tracking any indexing progress if any change is done in Index configuration.

Note that Count stats are currently not exposed via Azure Portal

Further this commit also enables emitting verbose trace for query when in debug mode. This would simplify any query performance analysis.

Fixes apache#4489
`lines` method is now defined as part of java.lang.String in JDK 11. So need to use `linesIterator` for right method to be picked
Adds a configurable MetricsReporter to route Kafka metrics to Kamon once enabled. Set of metrics names which need to be captured needs to be explicitly configured
* Update to restassured v4.0 which is compatible with jdk11

* Need to initialize both truststore and keystore for ssl cert to be validated
* Add explicit return type on a few implicit values.

Implicit values without an explicit return types are not guaranteed to work. They may or may not compile, depending on compilation order and they are going to be disallowed in future versions of Scala. This confuses IntelliJ as well.

* Make Exec public since it's being used from tests.

The call to Exec.isBinaryCode in ActionContainer.scala is from a different package and normally not visible at the call site. Due to scala/bug#11554 this may compile sometimes.
* simplify throttle code

* revert to original throttle algorithm

* Make the waiting time to calm down the thottle explicit
…rceManager impls; handle ClusterResourceError in case of reaching cluster capacity
…dle containers; only remove idle containers that match the actual host/port of selected container
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.