Skip to content

Commit

Permalink
Merge potential next release v0.4 (#187) Breaking Changes
Browse files Browse the repository at this point in the history
* First draft of the new n-dimensional arrays + NB use case
* Improves default implementation of multiple Array methods
* Refactors tree methods
* Adds matrix decomposition routines
* Adds matrix decomposition methods to ndarray and nalgebra bindings
* Refactoring + linear regression now uses array2
* Ridge & Linear regression
* LBFGS optimizer & logistic regression
* LBFGS optimizer & logistic regression
* Changes linear methods, metrics and model selection methods to new n-dimensional arrays
* Switches KNN and clustering algorithms to new n-d array layer
* Refactors distance metrics
* Optimizes knn and clustering methods
* Refactors metrics module
* Switches decomposition methods to n-dimensional arrays
* Linalg refactoring - cleanup rng merge (#172)
* Remove legacy DenseMatrix and BaseMatrix implementation. Port the new Number, FloatNumber and Array implementation into module structure.
* Exclude AUC metrics. Needs reimplementation
* Improve developers walkthrough

New traits system in place at `src/numbers` and `src/linalg`
Co-authored-by: Lorenzo <[email protected]>

* Provide SupervisedEstimator with a constructor to avoid explicit dynamical box allocation in 'cross_validate' and 'cross_validate_predict' as required by the use of 'dyn' as per Rust 2021
* Implement getters to use as_ref() in src/neighbors
* Implement getters to use as_ref() in src/naive_bayes
* Implement getters to use as_ref() in src/linear
* Add Clone to src/naive_bayes
* Change signature for cross_validate and other model_selection functions to abide to use of dyn in Rust 2021
* Implement ndarray-bindings. Remove FloatNumber from implementations
* Drop nalgebra-bindings support (as decided in conf-call to go for ndarray)
* Remove benches. Benches will have their own repo at smartcore-benches
* Implement SVC
* Implement SVC serialization. Move search parameters in dedicated module
* Implement SVR. Definitely too slow
* Fix compilation issues for wasm (#202)

Co-authored-by: Luis Moreno <[email protected]>
* Fix tests (#203)

* Port linalg/traits/stats.rs
* Improve methods naming
* Improve Display for DenseMatrix

Co-authored-by: Montana Low <[email protected]>
Co-authored-by: VolodymyrOrlov <[email protected]>
  • Loading branch information
3 people authored Oct 31, 2022
1 parent bb71656 commit 52eb6ce
Show file tree
Hide file tree
Showing 110 changed files with 10,390 additions and 9,170 deletions.
12 changes: 11 additions & 1 deletion .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,17 @@ email, or any other method with the owners of this repository before making a ch

Please note we have a [code of conduct](CODE_OF_CONDUCT.md), please follow it in all your interactions with the project.

## Background

We try to follow these principles:
* follow as much as possible the sklearn API to give a frictionless user experience for practitioners already familiar with it
* use only pure-Rust implementations for safety and future-proofing (with some low-level limited exceptions)
* do not use macros in the library code to allow readability and transparent behavior
* priority is not on "big data" dataset, try to be fast for small/average dataset with limited memory footprint.

## Pull Request Process

1. Open a PR following the template.
1. Open a PR following the template (erase the part of the template you don't need).
2. Update the CHANGELOG.md with details of changes to the interface if they are breaking changes, this includes new environment variables, exposed ports useful file locations and container parameters.
3. Pull Request can be merged in once you have the sign-off of one other developer, or if you do not have permission to do that you may request the reviewer to merge it for you.

Expand All @@ -16,6 +24,7 @@ Take a look to the conventions established by existing code:
* Every module should come with some reference to scientific literature that allows relating the code to research. Use the `//!` comments at the top of the module to tell readers about the basics of the procedure you are implementing.
* Every module should provide a Rust doctest, a brief test embedded with the documentation that explains how to use the procedure implemented.
* Every module should provide comprehensive tests at the end, in its `mod tests {}` sub-module. These tests can be flagged or not with configuration flags to allow WebAssembly target.
* Run `cargo doc --no-deps --open` and read the generated documentation in the browser to be sure that your changes reflects in the documentation and new code is documented.

## Issue Report Process

Expand All @@ -29,6 +38,7 @@ Take a look to the conventions established by existing code:
1. After a PR is opened maintainers are notified
2. Probably changes will be required to comply with the workflow, these commands are run automatically and all tests shall pass:
* **Coverage** (optional): `tarpaulin` is used with command `cargo tarpaulin --out Lcov --all-features -- --test-threads 1`
* **Formatting**: run `rustfmt src/*.rs` to apply automatic formatting
* **Linting**: `clippy` is used with command `cargo clippy --all-features -- -Drust-2018-idioms -Dwarnings`
* **Testing**: multiple test pipelines are run for different targets
3. When everything is OK, code is merged.
Expand Down
10 changes: 10 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,13 @@ smartcore.code-workspace

# OS
.DS_Store


flamegraph.svg
perf.data
perf.data.old
src.dot
out.svg

FlameGraph/
out.stacks
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## Added
- Seeds to multiple algorithims that depend on random number generation.
- Added feature `js` to use WASM in browser
- Drop `nalgebra-bindings` feature
- Complete refactoring with *extensive API changes* that includes:
* moving to a new traits system, less structs more traits
* adapting all the modules to the new traits system
* moving towards Rust 2021, in particular the use of `dyn` and `as_ref`
* reorganization of the code base, trying to eliminate duplicates

## BREAKING CHANGE
- Added a new parameter to `train_test_split` to define the seed.
Expand Down
41 changes: 20 additions & 21 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@
name = "smartcore"
description = "The most advanced machine learning library in rust."
homepage = "https://smartcorelib.org"
version = "0.2.1"
version = "0.4.0"
authors = ["SmartCore Developers"]
edition = "2018"
edition = "2021"
license = "Apache-2.0"
documentation = "https://docs.rs/smartcore"
repository = "https://github.com/smartcorelib/smartcore"
Expand All @@ -13,48 +13,47 @@ keywords = ["machine-learning", "statistical", "ai", "optimization", "linear-alg
categories = ["science"]

[features]
default = ["datasets"]
default = ["datasets", "serde"]
ndarray-bindings = ["ndarray"]
nalgebra-bindings = ["nalgebra"]
datasets = ["rand_distr", "std"]
fp_bench = ["itertools"]
std = ["rand/std", "rand/std_rng"]
# wasm32 only
js = ["getrandom/js"]

[dependencies]
approx = "0.5.1"
cfg-if = "1.0.0"
ndarray = { version = "0.15", optional = true }
nalgebra = { version = "0.31", optional = true }
num-traits = "0.2"
num-traits = "0.2.12"
num = "0.4"
rand = { version = "0.8", default-features = false, features = ["small_rng"] }
rand_distr = { version = "0.4", optional = true }
serde = { version = "1", features = ["derive"], optional = true }
itertools = { version = "0.10.3", optional = true }
cfg-if = "1.0.0"

[target.'cfg(target_arch = "wasm32")'.dependencies]
getrandom = { version = "0.2", optional = true }

[dev-dependencies]
smartcore = { path = ".", features = ["fp_bench"] }
criterion = { version = "0.4", default-features = false }
serde_json = "1.0"
bincode = "1.3.1"

[target.'cfg(target_arch = "wasm32")'.dev-dependencies]
wasm-bindgen-test = "0.3"

[[bench]]
name = "distance"
harness = false
[profile.bench]
debug = true

resolver = "2"

[[bench]]
name = "naive_bayes"
harness = false
required-features = ["ndarray-bindings", "nalgebra-bindings"]
[profile.test]
debug = 1
opt-level = 3
split-debuginfo = "unpacked"

[[bench]]
name = "fastpair"
harness = false
required-features = ["fp_bench"]
[profile.release]
strip = true
debug = 1
lto = true
codegen-units = 1
overflow-checks = true
45 changes: 41 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,48 @@

-----

## Developers
Contributions welcome, please start from [CONTRIBUTING and other relevant files](.github/CONTRIBUTING.md).

## Current status
* Current working branch is `development` (if you want something that you can test right away).
* Breaking changes are undergoing development at [`v0.5-wip`](https://github.com/smartcorelib/smartcore/tree/v0.5-wip#readme) (if you are a newcomer better to start from [this README](https://github.com/smartcorelib/smartcore/tree/v0.5-wip#readme) as this will be the next major release).

To start getting familiar with the new Smartcore v0.5 API, there is now available a [**Jupyter Notebook environment repository**](https://github.com/smartcorelib/smartcore-jupyter).
To start getting familiar with the new Smartcore v0.5 API, there is now available a [**Jupyter Notebook environment repository**](https://github.com/smartcorelib/smartcore-jupyter). Please see instructions there, your feedback is valuable for the future of the library.

## Developers
Contributions welcome, please start from [CONTRIBUTING and other relevant files](.github/CONTRIBUTING.md).

### Walkthrough: traits system and basic structures

#### numbers
The library is founded on basic traits provided by `num-traits`. Basic traits are in `src/numbers`. These traits are used to define all the procedures in the library to make everything safer and provide constraints to what implementations can handle.

#### linalg
`numbers` are made at use in linear algebra structures in the **`src/linalg/basic`** module. These sub-modules define the traits used all over the code base.

* *arrays*: In particular data structures like `Array`, `Array1` (1-dimensional), `Array2` (matrix, 2-D); plus their "views" traits. Views are used to provide no-footprint access to data, they have composed traits to allow writing (mutable traits: `MutArray`, `ArrayViewMut`, ...).
* *matrix*: This provides the main entrypoint to matrices operations and currently the only structure provided in the shape of `struct DenseMatrix`. A matrix can be instantiated and automatically make available all the traits in "arrays" (sparse matrices implementation will be provided).
* *vector*: Convenience traits are implemented for `std::Vec` to allow extensive reuse.

These are all traits and by definition they do not allow instantiation. For instantiable structures see implementation like `DenseMatrix` with relative constructor.

#### linalg/traits
The traits in `src/linalg/traits` are closely linked to Linear Algebra's theoretical framework. These traits are used to specify characteristics and constraints for types accepted by various algorithms. For example these allow to define if a matrix is `QRDecomposable` and/or `SVDDecomposable`. See docstring for referencese to theoretical framework.

As above these are all traits and by definition they do not allow instantiation. They are mostly used to provide constraints for implementations. For example, the implementation for Linear Regression requires the input data `X` to be in `smartcore`'s trait system `Array2<FloatNumber> + QRDecomposable<TX> + SVDDecomposable<TX>`, a 2-D matrix that is both QR and SVD decomposable; that is what the provided strucure `linalg::arrays::matrix::DenseMatrix` happens to be: `impl<T: FloatNumber> QRDecomposable<T> for DenseMatrix<T> {};impl<T: FloatNumber> SVDDecomposable<T> for DenseMatrix<T> {}`.

#### metrics
Implementations for metrics (classification, regression, cluster, ...) and distance measure (Euclidean, Hamming, Manhattan, ...). For example: `Accuracy`, `F1`, `AUC`, `Precision`, `R2`. As everything else in the code base, these implementations reuse `numbers` and `linalg` traits and structures.

These are collected in structures like `pub struct ClassificationMetrics<T> {}` that implements `metrics::Metrics`, these are groups of functions (classification, regression, cluster, ...) that provide instantiation for the structures. Each of those instantiation can be passed around using the relative function, like `pub fn accuracy<T: Number + RealNumber + FloatNumber, V: ArrayView1<T>>(y_true: &V, y_pred: &V) -> T`. This provides a mechanism for metrics to be passed to higher interfaces like the `cross_validate`:
```rust
let results =
cross_validate(
BiasedEstimator::fit, // custom estimator
&x, &y, // input data
NoParameters {}, // extra parameters
cv, // type of cross validator
&accuracy // **metrics function** <--------
).unwrap();
```


TODO: complete for all modules
18 changes: 0 additions & 18 deletions benches/distance.rs

This file was deleted.

56 changes: 0 additions & 56 deletions benches/fastpair.rs

This file was deleted.

73 changes: 0 additions & 73 deletions benches/naive_bayes.rs

This file was deleted.

Loading

0 comments on commit 52eb6ce

Please sign in to comment.