Skip to content

Commit

Permalink
Update readme to add more information (opensearch-project#81)
Browse files Browse the repository at this point in the history
* Create JvmService instance on demand.

Signed-off-by: Alex <[email protected]>

* Move the ml_parameters from XContent to the request parameters to avoid
the conflict with search XContent input.

Signed-off-by: Alex <[email protected]>

* Fix the security risks found by PenTest.
1. unhandled 500 server error.
2. Insecure Deserialization

Signed-off-by: Alex <[email protected]>

* Remove unnecessory '*' from the welcome list of model deserializer.

Signed-off-by: Alex <[email protected]>

* Update readme to add more information.

Signed-off-by: Alex <[email protected]>

* Add developer guide to the document.

Signed-off-by: Alex <[email protected]>

* Add documents for ml-commens.

Signed-off-by: Alex <[email protected]>

* Sync the build scripts

Signed-off-by: Alex <[email protected]>

* Remove the dependencies added to support Mleap.

Signed-off-by: Alex <[email protected]>

Co-authored-by: Alex <[email protected]>
  • Loading branch information
2 people authored and jackiehanyang committed Nov 16, 2021
1 parent 24a4c9c commit adf5dcd
Show file tree
Hide file tree
Showing 4 changed files with 183 additions and 106 deletions.
105 changes: 3 additions & 102 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,103 +1,4 @@
# Contributing Guidelines
## Contributing to this Project

Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
documentation, we greatly value feedback and contributions from our community.

Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
information to effectively respond to your bug report or contribution.


## Reporting Bugs/Feature Requests

We welcome you to use the GitHub issue tracker to report bugs or suggest features.

When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:

* A reproducible test case or series of steps
* The version of our code being used
* Any modifications you've made relevant to the bug
* Anything unusual about your environment or deployment


## Contributing via Pull Requests
Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:

1. You are working against the latest source on the *main* branch.
2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
3. You open an issue to discuss any significant work - we would hate for your time to be wasted.

To send us a pull request, please:

1. Fork the repository.
2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
3. Ensure local tests pass.
4. Commit to your fork using clear commit messages.
5. Send us a pull request, answering any default questions in the pull request interface.
6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.

GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
[creating a pull request](https://help.github.com/articles/creating-a-pull-request/).


## Finding contributions to work on
Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.


## Code of Conduct
This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
[email protected] with any additional questions or comments.

## Developer Certificate of Origin

OpenSearch is an open source product released under the Apache 2.0 license (see either [the Apache site](https://www.apache.org/licenses/LICENSE-2.0) or the [LICENSE.txt file](./LICENSE.txt)). The Apache 2.0 license allows you to freely use, modify, distribute, and sell your own products that include Apache 2.0 licensed software.

We respect intellectual property rights of others and we want to make sure all incoming contributions are correctly attributed and licensed. A Developer Certificate of Origin (DCO) is a lightweight mechanism to do that.

The DCO is a declaration attached to every contribution made by every developer. In the commit message of the contribution, the developer simply adds a `Signed-off-by` statement and thereby agrees to the DCO, which you can find below or at [DeveloperCertificate.org](http://developercertificate.org/).

```
Developer's Certificate of Origin 1.1
By making a contribution to this project, I certify that:
(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license
indicated in the file; or
(b) The contribution is based upon previous work that, to the
best of my knowledge, is covered under an appropriate open
source license and I have the right under that license to
submit that work with modifications, whether created in whole
or in part by me, under the same open source license (unless
I am permitted to submit under a different license), as
Indicated in the file; or
(c) The contribution was provided directly to me by some other
person who certified (a), (b) or (c) and I have not modified
it.
(d) I understand and agree that this project and the contribution
are public and that a record of the contribution (including
all personal information I submit with it, including my
sign-off) is maintained indefinitely and may be redistributed
consistent with this project or the open source license(s)
involved.
```
We require that every contribution to OpenSearch is signed with a Developer Certificate of Origin. Additionally, please use your real name. We do not accept anonymous contributors nor those utilizing pseudonyms.

Each commit must include a DCO which looks like this

```
Signed-off-by: Jane Smith <[email protected]>
```
You may type this line on your own when writing your commit messages. However, if your user.name and user.email are set in your git configs, you can use `-s` or `--signoff` to add the `Signed-off-by` line to the end of the commit message.

## Security issue notifications
If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.


## Licensing

See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
OpenSearch is a community project that is built and maintained by people just like **you**.
[This document](https://github.com/opensearch-project/.github/blob/main/CONTRIBUTING.md) explains how you can contribute to this and related projects.
65 changes: 65 additions & 0 deletions DEVELOPER_GUIDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
- [Developer Guide](#developer-guide)
- [Forking and Cloning](#forking-and-cloning)
- [Install Prerequisites](#install-prerequisites)
- [JDK 14](#jdk-14)
- [Setup](#setup)
- [Build](#build)
- [Building from the command line](#building-from-the-command-line)
- [Building from the IDE](#building-from-the-ide)
- [Debugging](#debugging)

## Developer Guide

### Forking and Cloning

Fork this repository on GitHub, and clone locally with `git clone`.

### Install Prerequisites

#### JDK 14

OpenSearch components build using Java 14 at a minimum. This means you must have a JDK 14 installed with the environment variable `JAVA_HOME` referencing the path to Java home for your JDK 14 installation, e.g. `JAVA_HOME=/usr/lib/jvm/jdk-14`.

### Setup

1. Clone the repository (see [Forking and Cloning](#forking-and-cloning))
2. Make sure `JAVA_HOME` is pointing to a Java 14 JDK (see [Install Prerequisites](#install-prerequisites))
3. Launch Intellij IDEA, Choose Import Project.

### Build

This package uses the [Gradle](https://docs.gradle.org/current/userguide/userguide.html) build system. Gradle comes with excellent documentation that should be your first stop when trying to figure out how to operate or modify the build. we also use the OpenSearch build tools for Gradle. These tools are idiosyncratic and don't always follow the conventions and instructions for building regular Java code using Gradle. Not everything in this package will work the way it's described in the Gradle documentation. If you encounter such a situation, the OpenSearch build tools [source code](https://github.com/opensearch-project/OpenSearch/tree/main/buildSrc/src/main/groovy/org/opensearch/gradle) is your best bet for figuring out what's going on.

#### Building from the command line

1. `./gradlew build` builds and tests
2. `./gradlew :run` launches a single node cluster with ml-commons plugin installed
3. `./gradlew :integTest` launches a single node cluster with ml-commons plugin installed and runs all integration tests except security
4. ` ./gradlew :integTest --tests="**.test execute foo"` runs a single integration test class or method
5. `./gradlew spotlessApply` formats code. And/or import formatting rules in `.eclipseformat.xml` with IDE.

When launching a cluster using one of the above commands logs are placed in `/build/cluster/run node0/opensearch-<version>/logs`. Though the logs are teed to the console, in practices it's best to check the actual log file.

#### Building from the IDE

Currently, the only IDE we support is IntelliJ IDEA. It's free, it's open source, it works. The gradle tasks above can also be launched from IntelliJ's Gradle toolbar and the extra parameters can be passed in via the Launch Configurations VM arguments.

#### Debugging

Sometimes it's useful to attach a debugger to either the OpenSearch cluster or the integ tests to see what's going on. When running unit tests you can just hit 'Debug' from the IDE's gutter to debug the tests. To debug code running in an actual server run:

```
./gradlew :integTest --debug-jvm # to start a cluster and run integ tests
OR
./gradlew :run --debug-jvm # to just start a cluster that can be debugged
```

The OpenSearch server JVM will launch suspended and wait for a debugger to attach to `localhost:8000` before starting the OpenSearch server.

To debug code running in an integ test (which exercises the server from a separate JVM) run:

```
./gradlew -Dtest.debug :integTest
```

The test runner JVM will start suspended and wait for a debugger to attach to `localhost:5005` before running the tests.
27 changes: 27 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -173,3 +173,30 @@
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.

END OF TERMS AND CONDITIONS

APPENDIX: How to apply the Apache License to your work.

To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright [yyyy] [name of copyright owner]

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
92 changes: 88 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,96 @@
## OpenSearch Machine Learning
<img src="https://opensearch.org/assets/brand/SVG/Logo/opensearch_logo_default.svg" height="64px"/>

Machine Learning Framework for OpenSearch is a new solution that make it easy to develop new machine learning feature. It allows engineers to leverage existing opensource machine learning algorithms and reduce the efforts to build any new machine learning feature. It also removes the necessity from engineers to manage the machine learning tasks which will help to speed the feature developing process.
<!-- TOC -->

- [OpenSearch Machine Learning Commons](#opensearch-machine-learning-commons)
- [Contributing](#contributing)
- [Code of Conduct](#code-of-conduct)
- [Security](#security)
- [License](#license)
- [Copyright](#copyright)

<!-- /TOC -->

## OpenSearch Machine Learning Commons

Machine Learning Commons for OpenSearch is a new solution that make it easy to develop new machine learning feature. It allows engineers to leverage existing opensource machine learning algorithms and reduce the efforts to build any new machine learning feature. It also removes the necessity from engineers to manage the machine learning tasks which will help to speed the feature developing process.

### Problem Statement

Until today, the challenge is significant to build a new machine learning feature inside OpenSearch. The reasons include:

* **Disruption to OpenSearch Core features**. Machine learning is very computationally intensive. But currently there is no way to add dedicated computation resources in OpenSearch for machine learning jobs, hence these jobs have to share same resources with Core features, such as: indexing and searching. That might cause the latency increasing on search request, and cause circuit breaker exception on memory usage. To address this, we have to carefully distribute models and limit the data size to run the AD job. When more and more ML features are added into OpenSearch, it will become much harder to manage.
* **Lack of support for machine learning algorithms.** Customers need more algorighms within Opensearch, otherwise the data need be exported to outside of elasticsearch, such as s3 first to do the job, which will bring extra cost and latency.
* **Lack of resource management mechanism between multiple machine learning jobs.** It's hard to coordinate the resources between multi features.


In the meanwhile, we observe more and more machine learning features required to be supported in OpenSearch to power end users’ business needs. For instance:

* **Forecasting**: Forecasting is very popular in time series data analysis. Although the past data isn’t always an indicator for the future, it’s still very powerful tool used in some use cases, such as capacity planning to scale up/down the service hosts in IT operation.
* **Root Cause Analysis in DevOps**: Today some customers use OpenSearch for IT operations. It becomes more and more complicated to identify the root cause of an outage or incident since it needs to gather all the information in the ecosystem, such as log, traces, metrics. Machine learning technique is a great fit to address this issue by building topology models of the system automatically, and understanding the similarity and casual relations between events, etc.
* **Machine Learning in SIEM**: SIEM(Security Information and Event Management) is another domain in OpenSearch. Machine learning is also very useful in SIEM to help facilitate security analytics, and it can reduce the effort on sophisticated tasks, enable real time threat analysis and uncover anomalies.

### Solution
The solution is to introduce a new Machine Learning library inside the OpenSearch cluster. The major functionalities in this solution include:

* **Unified Client Interfaces:** clients can use common interfaces for training and inference tasks, and then follow the algorithm interface to give right input parameters, such as input data, hyperparameters. A client library will be built for easy use.
* **ML Plugin:** ML plugin will help to initiate the ML nodes, and choose the right nodes and allocate the resources for each request, and manage machine learning tasks with monitoring and failure handing supports, and store the model results; it will be the bridge for the communication between OpenSearch process and ML engine.
* **ML Engine**: This engine will be the host for ML algorithms. Java based machine learning algorithms will be supported in the first release.

This solution makes it easy to develop new machine learning features. It allows engineers to leverage existing open-source machine learning algorithms, and reduce the efforts to build any new machine learning feature. It also removes the necessity from engineers to manage the machine learning tasks which will help to speed up the feature developing process.

### How to use it for new feature development

As mentioned above, new interfaces including both prediction and training will be provided to customers through Rest APIs, and to other plugins through transport action. Here are the transport action for prediction and training interfaces.

* Predict Transport Action for prediction job request
```
Request: {
"algorithm": "ARIMA", //the name of algorithm
"parameters": {"forecasts_en":10, "seasonal"=true}, // parameters of the algorithm, can be null or empty
"modelId":123, //the id for trainded model.
"inputData": [[1.0, 2, 3.1, true, "v1"],[1.1, 4, 5.2, false, "v2"]] // internal data frame interface
}
Response: {
"taskId": "123", //the id of the job request
"status": "SUCCESS", // the job execution status
"predictionResult": [[6.0],[7.0]] // internal data frame interface
}
```
* Training Transport Action to start training job request - Async Interface
```
Request: {
"algorithm": "ARIMA", //the name of algorithm
"parameters": {"forecasts_en":10, "seasonal"=true}, // parameters of the algorithm, can be null or empty
"inputData": [[1.0, 2, 3.1, true, "v1"],[1.1, 4, 5.2, false, "v2"]] // internal data frame interface
}
Response: {
"taskId": "123", //the id of the job request
"status": "IN_PROGRESS" // the job execution status
}
```

## Contributing

See [developer guide](DEVELOPER_GUIDE.md) and [how to contribute to this project](CONTRIBUTING.md).

## Code of Conduct

This project has adopted the [Amazon Open Source Code of Conduct](CODE_OF_CONDUCT.md). For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq), or contact [[email protected]](mailto:[email protected]) with any additional questions or comments.

## Security

See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public GitHub issue.

## License

See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
This project is licensed under the [Apache v2.0 License](LICENSE).

## Copyright

Copyright 2020-2021 Amazon.com, Inc. or its affiliates. All Rights Reserved.

0 comments on commit adf5dcd

Please sign in to comment.