Releases: modin-project/modin
Modin 0.8.2
Modin 0.8.2 release notes The Modin 0.8.2 release contains a significant amount of code cleanup and bugfixes. The release contains total of 61 commits closing 59 issues. The highlights of this release are listed below. For the full release notes, please run git log --pretty=oneline 0.8.1.1...0.8.2 Highlighted commits ------------------- * FIX-#2369: Update pandas version to 1.1.4 (#2371) * FIX-#2365: Fix `Series.value_counts` when `dropna=False` (#2366) * FEAT-#1844: upgrade pyarrow to 1.0 (#2347) * FEAT-#2271: Add implementation of `groupby.shift` (#2323) * DOCS-#2334: Add tutorials to main repo (#2335) * FIX-#2311: fixed performance bottleneck at reduction operations (#2314) * FIX-#2133 #2265: Fix binary operations for modin frames in case when partitioning isn't aligned (#2256) * FEAT-#2303: fix OmniSci aggregates and add mean (#2304) * FEAT-#2299: support value_counts in OmniSci backend. (#2300) * FEAT-#2282: support DataFrame.[count|max|min|sum] for OmniSci backend (#2283) * FIX-#1988: Fix indexing over Series via `loc` (#2262) * FIX-#1965: Fix `count` func in case `numeric_only`==True (#2228) Contributors this release ------------------------- The following users contributed code to Modin since the last release. @kvu35 (First Time contributor) ⭐️ @ienkovich @prutskov @amyskov @vnlitvinov @dchigarev @YarShev @anmyachev @gshimansky @devin-petersohn
Modin 0.8.1.1
Modin 0.8.1 release notes Dependencies ------------ * FIX-#2113: Ray 1.0 compatibility (#2114)
Modin 0.8.1
Modin 0.8.1 release notes The Modin 0.8.1 release contains a large amount of new functionality and bugfixes. Additionally, a large amount of effort this release was spent improving the code quality and testing infrastructure of Modin developers. This is the first release that can be used with Omnisci as a compute backend (experimentall:y). Bugfixes + Pandas Concordance (🐛 + 🐼) ---------------------------------------- * FIX-#1647: Support repr() on empty Series. (#1859) * Fix recursion in experimental mode in some cases (#1874) * FIX-#1674: Series.apply and DataFrame.apply (#1718) * FIX-#1869: index sort for count(level=...) (#1870) * FIX-#1497: Don't sort in concat() when sort=False (#1889) * FIX-#1854: groupby() with arbitrary series (#1886) * FIX-#1959 #1987: Fix `duplicated` and `drop_duplicates` functions (#1994) * FEAT-#1285: Add `sem` implementation for `Series` and `DataFrame` (#2048) * FIX-#2054: Moved non-dependent on modin.DataFrame utils to modin/utils.py (#2055) * FIX-#2052: fix spawning of remote cluster (#2053) * FIX-#1918: fix core dumped issue (#2000) * FIX-#1386: Fix `read_csv` for incorrect csv data (#2076) * FIX-#1997: Fix `unstack` for MultiIndex with different inner lvl-nodes (#2012) * FIX-#2080: engine dispatching moved to a separate folder (#2081) * FEAT-#1957: abstract methods in BaseQueryCompiler replaced to defaults (#2047) * FIX-#1997 #2084: Fix unstack for case when columns have (#2086) * FIX-#2069: Add workaround for Python issubclass() quirk (#2070) * REFACTOR-#2101: avoid unconditional index access in DataFrame.rename (#2102) * FIX-#2110: get rid of 'NotImplementedError' at OmniSci query compiler (#2112) * FIX-#1900: Fix bug in groupby when index name is passed by string (#2125) * BUG-#2127: fix delimiter param for pyarrow based read_csv (#2129) * FIX-#2145: add cloud dependencies to conda dev environment (#2143) * FIX-#2148: add note about braceexpand for cloud examples (#2149) * FIX-#2151: add add_conda_packages for remote omnisci (#2152) * FIX-#2147: return interval for python micro version - *.*.X (#2146) * FIX-#2134: Fix mismatch partitioning insertions with same index (#2140) * FEAT-#2154: generate MultiIndex for columns in groupby.agg (#2155) * FIX-#1921: Fix `read_excel` when sheet names are non-default (#2159) * FIX-#2156: improve index name mangling (#2158) * FIX-#2172: support float32 in calcite serializer (#2173) New Functionality ✨ -------------------- * FEAT-#1881: add scale-out feature dependencies (#1892) * FEAT-#2058: Improve how remote factories are defined (#2060) * FEAT-#1871: introduce OmniSci based experimental engine (#2079) * FEAT-#2108: Save rpyc server output if rpyc logging is on (#2109) * FEAT-#2085: sync python version between both contexts (#2107) * FEAT-#1991: Enable OmniSci on cloud (#2119) * FIX-#1144: Fix `read_parquet` for working with HDFS (#2120) * FEAT-#1992: enable ETL part of LoanPD bench in cloud (#2106) * FEAT-#2089: add ability to install additional conda packages (#2117) * FEAT-#1200: pivot_table implementation (#1669) * FEAT-#2141: support skew aggregate in omnisci backend (#2142) * FEAT-#1219, FEAT-#2135: Add `corr` and `cov` (#2130) * FEAT-#1847: High performance, no shuffle train_test_split (#1848) * FEAT-#2138: sync modin version between local and remote contexts (#2153) Code Quality + Testing 💯 ------------------------- * TEST-#1876: Add tests running under experimental (#1877) * FIX-#1867: establish CI (#1868) * FIX-#1887: fix versions (#1888) * TEST-#1865: Add RPyC library in requirements (#1866) * TEST-2022: speed up prepare-cache job (#2023) * TEST-#2024: remove test_dataframe.py (#2025) * TEST-#2020: decrease parallel tests on Ubuntu (#2021) * TEST-#2030: speed up cache; decrease parallel jobs in push.yml (#2031) * TEST-#2028: speed up window tests (#2029) * TEST-#2026: speed up test_join_sort.py (#2027) * TEST-#2037: speed up test_binary with refactor dataframe.py (#2038) * TEST-#2044: speed up iter tests (#2045) * TEST-#2042: speed up udf tests (#2043) * TEST-#2033: speed up test_series.py (#2034) * TEST-#2039: speed up default tests (#2040) * TEST-#2050: decrease number of parallel jobs on windows Ci (#2051) * TEST-#1891: use conda instead of pip (#2056) * REFACTOR-#2035: move getitem_array to the backend (#2036) * REFACTOR-#2083: Rename LISCENSE_HEADER to LICENSE_HEADER. (#2082) * REFACTOR-#1839: Update pandas dependency and pandas APIs to match (#1840) * Simulate cluster for testing remote context (#1982) * Fix test_from_csv for simulated remote case (#2111) * TEST-#2123: testing of OmniSci added at CI (#2124) * FEAT-#2087: Added benchmarks test suite (#2103) * FEAT-#2136: Added benchmarks for mask generation and indexing (#2137) * FIX-#2162: exclude test folders from coverage report (#2160) * REFACTOR-#2170: simplify concat of a single frame (#2171) * FEAT-#2166: Added benchmarks for DataFrame.merge (#2167) Backend enhancements + Performance 🚀 ------------------------------------- * FEAT-#1861: Use cloudpickle library for experimental.cloud features (#1862) * Fix access to special attributes in experimental mode (#1875) * REFACTOR-#2011: move default_to_pandas in groupby to backend (#2041) * FIX-#2115: Use `seek` when we don't need to check quotes for CSV (#2116) Documentation 📃 ---------------- * Conda recipe for Modin (#1986) * FIX-#2131: Add note on `value_counts` for `DataFrame` in the doc (#2132) * DOC-REFACTOR: 1/n Refactoring the documentation (#2095) * DOCS-#2068: Provide Jupyter notebooks showing running NYC Taxi (#2168) * DOCS-#2176: Add plantuml and issues to doc dependencies (#2177) Dependencies ------------ * FIX-#911: Pin Dask Dependency for Python 3.8 compatiblity (#1846) * FIX-#2090: Do not always require pyarrow (#2126) Contributors this release ------------------------- The following users contributed code to Modin since the last release. @abykovsk (First Time contributor) ⭐️ @anton-malakhov (First Time contributor) ⭐️ @heuermh (First Time contributor) ⭐️ @ienkovich (First Time contributor) ⭐️ @itamarst (Returning contributor) 🌟 @prutskov (Returning contributor) 🌟 @amyskov (Returning contributor) 🌟 @vnlitvinov (Returning contributor) 🌟 @dchigarev (Returning contributor) 🌟 @YarShev (Returning contributor) 🌟 @anmyachev (Returning contributor) 🌟 @gshimansky (Returning contributor) 🌟 @devin-petersohn (Maintainer)
Modin 0.8.0
Modin 0.8.0 release notes The Modin 0.8.0 release is one of the biggest releases yet, and includes several bugfixes and new functionality, highlighted below. One of the new key features is the ability to spawn and run Modin code on a cluster via a new experimental cloud API. This API allows you to switch between running on your laptop and running in the cloud, across multiple clusters. The API is as simple as: ``` import modin.pandas as pd from modin.experimental.cloud import cluster example_cluster = cluster.create("aws", "aws_credentials") with example_cluster: remote_df = pd.DataFrame([1, 2, 3, 4]) print(len(remote_df)) # len() is executed remotely local_df = pd.read_csv("my.csv") print(len(local_df)) ``` With this simple API, data scientists have more power at their fingertips. The high level overview of the major bugfixes and new functionality can be found below. Bugfixes + Pandas Concordance (🐛 + 🐼) ---------------------------------------- * Level parameter for kurt function implementation (#1567) * Fix of issue #1462: groupby_agg ignores exceptions (#1703) * Fix AttributeError: module 'numpy.random' has no attribute 'randomState' (#1707) * Correctly handle mismatched quotes and csv.QUOTE_NONE flag. (#1555) * Fix for Series.attrs and Series.array (#1717) * Fix #1683 - losing index names in pd.concat (#1684) * Use low-level api for kurt function implementation with defined level parameter (#1719) * Fix of inconsistent indices (#1727) * Fix support for callable in loc/iloc (#1776) * Fix support for nested assignment with `loc`/`iloc` (#1788) * Fix support for `loc` with MultiIndex parameter (#1789) * Fix metadata for concat and mask when `axis=1` (#1797) * Fix unlimited column printing for smaller dataframes (#1799) * Fix visual bug with repr on smaller dataframes (#1798) * Fix support for cummax and cummin across int and float (#1800) * Fix support for dictionary in `pd.concat` (#1795) * Series.reset_index considering 'name' fix (#1820) * `to_pandas' of nested objects added (#1828) * Don't sort indexes in Series functions with level parameter (#1830) * Fix result of `Series.dt.components/freq/tz` (#1730) * Groupby on categories fixed (#1802) * product/sum incorrect behavior of 'min_count' fixed (#1827) * Support for groupby() with original Series in by list. (#1842) * make 'sort_index' consider axis parameter (#1858) * properly process UDFs (#1845) New Functionality ✨ -------------------- * Add implementation of `resample` for Series and DataFrame (#1625) * Add `merge` implementation for `DataFrame` and as free function (#1695) * melt implementation (#1689) * Enable running Modin via remote Ray on spawned cluster (#1818) 🎉 Code Quality + Testing 💯 ------------------------- * Move logic of `sort_values` into the query compiler (#1754) * Add commitlint check on pull requests (#1760) * REFACTOR-#1763: Move logic of `merge` (#1764) * Limit object store to 1GB during CI tests (#1744) Backend enhancements + Performance 🚀 ------------------------------------- * Improve performance of slice indexing (#1753) * Update iterator implemetion to `iloc` (#1599) * Speed up RPyC connection (#1833) Documentation 📃 ---------------- * Fix missing links in the architecture page (#1810) * add runner of taxi benchmark as example (#1836) * Add notes about using MODIN_SOCKS_PROXY variable (#1817) * add runner of h2o benchmark as example (#1856) Contributors this release ------------------------- The following users contributed code to Modin since the last release. @hwsamuel (First Time contributor) ⭐️ @ikedaosushi (First Time contributor) ⭐️ @itamarst (First Time contributor) ⭐️ @pratheekrebala (First Time contributor) ⭐️ @prutskov (Returning contributor) 🌟 @amyskov (Returning contributor) 🌟 @vnlitvinov (Returning contributor) 🌟 @dchigarev (Returning contributor) 🌟 @YarShev (Returning contributor) 🌟 @anmyachev (Returning contributor) 🌟 @gshimansky (Returning contributor) 🌟 @devin-petersohn (Maintainer) 🎉🎉 Thank you! 🎉🎉
Modin version 0.7.4
Minor release for version/bugfix updates
Modin 0.7.3 release notes
Bugfixes + Pandas Concordance (🐛 + 🐼) ---------------------------------------- * DataFrame.drop_duplicates correctly identifies subset type now (718ae58d) * Fix error in Groupby when grouping with multiple columns (776ae108) * Update supported pandas version to 1.0.3 (588444fb) * Fix numpy array assignment on empty dataframe (f763cea1) * Add temporary fix for ray race condition (78395d96) * Choosing depedencies for pip install modin[all] based on platform (c69f455c) * Make DataFrame and Series constructor behavior match pandas (fa3bd55b) * Assgning Series to an empty DataFrame is done correctly (362d20f2) * Fix cases where read_json should fall back to pandas (670d56f3) * Fix parquet file reading when file was written with pandas Index (8a4a3859) * Fix internal APIs for read_gbq (bf437e75) * Fix cases where read_json should fall back to pandas (670d56f3) * Change ‘getitem_slice’ to ‘iloc’ implementation (8efca60f) * Fix cases where read_json should fall back to pandas New Functionality ✨ -------------------- * Add MODIN_CPUS env variable for command line CPU specification (745caf7c) * Add implementation for Series.groupby (4b49f1ae) ⭐️ * Add broadcast_apply internally to apply functions (439b396f) * Add parallel support for binary functions with Series other (79d639f1) Code Quality + Testing 💯 ------------------------- * Add cov flags to pytest.ini so coverage is always computed (#1132) (51f195f5) * Update GitHub Actions to remove those tests run on TeamCity (9e6851ea) * Remove dead code from PythonFrameManager (5dc904b6) * Make query compiler base class a Python ABC. (e9b839a7) * Deprecate Dask Delayed in favor of Dask Futures (7edfc489) * Add Dask to coverage tests (1ea094b4) Backend enhancements + Performance 🚀 ------------------------------------- * Improve import performance and reduce sys.modules (68d377aa) Documentation 📃 ---------------- * Add Windows Conda environment file for development on Windows machine. (8c56e1a6) * Fix description of from_pandas method (1005b365) * Fix description of some methods (3650a137) Dependencies 🔗 --------------- * Bump psutil from 5.4.8 to 5.6.6 (7f067677) * Pin pyarrow to 0.16 until we can update interfaces internally (8fc5532b) * Update Ray version to 0.8.4 (6d90c83b) Contributors this release The following users contributed code to Modin since the last release. @YarShev (First time contributor) ⭐️ @anmyachev (First time contributor) ⭐️ @datapythonista (First time contributor) ⭐️ @gshimansky (Returning contributor) 🌟 @simon-mo (Maintainer) @devin-petersohn (Maintainer) 🎉🎉 Thank you! 🎉🎉
Modin 0.7.2 release notes
This release updates the pandas version and fixes some minor bugs.
Bugfixes + Pandas Concordance (🐛 + 🐼)
- Fix
fill_value
parameter forDataFrame.sub
, other binary ops (#1109)
Code Quality + Testing 💯
- Make query compiler interface consistent for binary ops (#1106)
- Add a bot message to display TeamCity test results (#1124)
Dependencies 🔗
- Add randomly generated redis password for Ray by default (#1107)
- Update pandas version to 1.0 and fix compatibility (#1081) ⭐️
Contributors this release
The following users contributed code to Modin since the last release.
@gshimansky (Maintainer)
@devin-petersohn (Maintainer)
🎉🎉 Thank you! 🎉🎉
Modin 0.7.1 release notes
In this release we focused on improving testing infrastructure and fixing outstanding bugs. One particular longstanding bug with the Dask runtime, related to serialization, was fixed (#1096). This revealed an oppourtinity for more optimization when it comes to serialization within the repository.
Bugfixes + Pandas Concordance (🐛 + 🐼)
- Fix loc when column mask is boolean and no rows are masked (#1039)
- Fix opening a file from s3 when S is capitalized (#1045)
- Fix pd.to_datetime when using a Series object. (#1048)
- Fix issue with Dask engine where users were not able to use Da (#1050)
- Fix
as_index=False
forDataFrame.groupby
(#1041) - Fix ingesting parquet files that are coming from HDFS (#1074)
- fixed bug where .to_frame(name) ignored name; added parity wit (#1075)
- Fix column indexing when 2 or more columns have same name (#1077)
- Fix Series.getitem to accept a callable as key (#1084)
- bugfix - sometimes df.loc[s] applies to columns instead of rows (#1088)
- df.rename works with 'mapper' and 'axis' params (#1057)
- Fix case where apply a Series across the columns threw error (#1092)
- Fix Dask Serialization issue (#1096)
New Functionality ✨
- Add r operators to Series (#1086)
Code Quality + Testing 💯
- Update Test infrastructure to use testmon when possible (#1036)
- Re-enable testmon force selection (#1040)
- Update master to track coverage again (#1052)
- Update test for groupby and clean up groupby edge cases (#1055)
- Change Error to warning for pandas version pin (#1072)
- Upload coverage correctly from merged PRs (#1076)
Documentation 📃
- Update Installation Documentation (#1066)
- Documentation updates (#1097)
- Add documentation for signed-off-by policy (#1098)
Dependencies 🔗
Contributors this release
The following users contributed code to Modin since the last release.
@elonp (First time contributor) ⭐️
@KevOBrien (First time contributor) ⭐️
@devin-petersohn (Maintainer)
🎉🎉 Thank you! 🎉🎉
Modin 0.7.0
Modin 0.7.0 release notes
Modin 0.7 comes with the largest expansion of the API since the first release. Modin now supports over 83% of the pandas API, up from 71% last release. A number of long awaited features have been implemented to include: I/O support for Dask for parquet and other column stores and groupby
with a list of column names.
Bugfixes + Pandas Concordance (🐛 + 🐼)
- Allow merging of named Series (#879)
- Correctly merge
CategoricalDtype
dtypes (#889) - Fix issue where certain arguments were not defaulting to pandas (#890)
- Send full path to workers on
read_csv
(#899) - Remove
__array_prepare__
from Series API (#900) - Add Series.str.title to API (#901)
- Fix
df.squeeze
whenaxis=0
on a 1x1 dataframe (#902) - Fix
skiprows
logic forread_csv
(#918) - read_sql() will default to pandas when chuksize is given (#920)
- Fix DeprecationWarning: invalid escape sequence \d (#950)
- Fix
apply
whenargs
is set (#953) - Fix inplace updates on partitions (#962)
- Fix bug where certain encodings were throwing an error (#980)
- Fix inplace operations without inplace keyword on emtpy dataframes (#983)
- Support console with repr like pandas (#984)
- Fix
count
whennumeric_only=False
(#1002) - Fix bug in
loc
where slice on columns only threw Exception (#1024)
New Functionality ✨
- support for duplicated() and drop_duplicates() (#892)
- Create SeriesGroupBy wrapper to default to pandas and return to Modin (#908)
- Bring I/O support to Dask for everything supported (#955) ⭐️
- Add support for grouping by multiple columns when doing a reduction (#987) ⭐️
- Implement
DataFrame.at_time
andSeries.at_time
(#991) - Add implementation for
between_time
forSeries
and `DataFram… (#992) - Implement
combine
for Series and DataFrame (#995) - Add implementation for
combine_first
forDataFrame
and `Seri… (#996) - Add implementation for
droplevel
forSeries
andDataFrame
(#1000) - Implement
assign
forDataFrame
(#998) - Add implementation for
first
forSeries
andDataFrame
(#1006) - Add implementation for
last
forDataFrame
andSeries
(#1007) - Add implementation for
swapaxes
for DataFrame and Series (#1010) - Add implementation for
tz_convert
forSeries
andDataFrame
(#1013) - Implement
tz_localize
forSeries
andDataFrame
(#1014) - Add implementation for
tshift
(#1016) - Add implementation:
swaplevel
forSeries
andDataFrame
(#1018) - Add implementation:
reorder_levels
for DataFrame and Series (#1022) - Add implementation:
take
for Series and DataFrame (#1020) - Add implentation:
truncate
for Series and DataFrame (#1026) - Fix bug where Parsing error was thrown when text spanned multip… (#1027)
Code Quality + Testing 💯
- Update pytest and clean up tests a bit (#903)
- CI updates (#924, #925, #926, #927, #928, #929, #930, #931, #933, #936, #937, #938)
- Fix Windows Remove file test error (#943)
- Add test script for simple execution of all unit tests (#948)
- Fix pyarrow.parquet import in tests (#952)
- Support parameters with run-tests.sh (#959)
- Fix environment variables for CI and Master test suite (#964)
- Make test_dataframe.py more granular with test suite (#966)
- Cache pip depdendencies between builds to speed up process (#967)
- Fix master CI workflow order and simplify workflow names (#968)
- Update master build to allow coverage to be run (#969)
- Optimize CI with minimal pip installs (#971)
- use versioneer for versioning using VCS (#1028)
Backend enhancements + Performance 🚀
- Improve performance of setting a column from an existing one (#942)
- Increase
n_workers
for Dask from default to number of cores (#965)
Documentation 📃
- Update README to have a more accurate API coverage section (#974)
- Move API Coverage section in README to a more appropriate place (#975)
- Update README to add advanced usage (#988)
Dependencies 🔗
- Enforce only Python3+ on future releases (#907)
- Change Coverage version to avoid sqlite3 errors (#916)
- Remove top level import of
py
(#935) - Update Ray version to latest (#941)
- Restructure import attempt to only try Ray if on a non-windows machine (#945)
- Set
pure=False
for DaskClient.submit
andhash=False
for `… (#957)
Contributors this release
The following users contributed code to Modin since the last release.
@ecoughlan (First time contributor) ⭐️
@aeroaks (First time contributor) ⭐️
@eavidan (Returning contributor) 🌟
@devin-petersohn (Maintainer)
🎉🎉 Thank you! 🎉🎉
Modin 0.6.3
Modin 0.6.3 release notes
The 0.6.3 release comes with several bugfixes and code quality improvements. Notably, the CI was moved to GitHub Actions for faster test completion time and better developer experience. There were also new additions to Modin functionality: level
argument for any
and all
and new implementation for Dask readers with read_csv
and read_json
.
Bugfixes + Pandas Concordance (🐛 + 🐼)
- Fix level parameter for aggregations when names are
None
(#838) - Fix
map
andapply
over aSeries
that contains list values (#840) - Fix metadata computation for
get_dummies
(#845) - fix for Python 3.7: re._pattern_type no longer exist (#849)
- Fix integer overflow when doing a mask for Windows users (#858)
- Fix issue where partition metadata would be off after insert + window (#869)
- Fix issue for updating to Categorical Dtypes from
astype
(#870) - Fix psycopg2 connection functionality (#736)
New Functionality ✨
- Add support for level argument in any, all (#833)
- Add Implementation for Dask
read_csv
andread_json
. (#861)
Code Quality + Testing 💯
- Remove legacy _get_nan_block_id (#844)
- ci: upgrade linters (flake8, black) (#848)
- add GitHub Actions workflow for CI (#850)
- Deprecate Travis CI in favor of GitHub Actions (#856)
- Add CODECOV_SECRET to the environment variable to upload Codecov report (#857)
- Remove debugging prints and add a test for prints (#768)
Backend enhancements + Performance 🚀
- Refactor IO by breaking up into modular readers (#754)
Documentation 📃
- Update actions name and create badge on README (#854)
Dependencies 🔗
Contributors this release
The following users contributed code to Modin since the last release.
@rliu4439 (First time contributor) ⭐️
@smola (First time contributor) ⭐️
@RehanSD (Returning contributor) 🌟
@devin-petersohn (Maintainer)
🎉🎉 Thank you! 🎉🎉