Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Minimal indices dtype for FixedShapeSparseTensors #3149

Merged
merged 42 commits into from
Nov 13, 2024

Conversation

sagiahrac
Copy link
Contributor

@sagiahrac sagiahrac commented Oct 30, 2024

The indices in FixedShapeSparseTensors are limited by the total number of elements within each tensor. As long as they remain within the range defined by the tensor’s shape, we can choose a more compact data type for the indices, reducing memory usage without sacrificing functionality.

@github-actions github-actions bot added the enhancement New feature or request label Oct 30, 2024
Copy link

codspeed-hq bot commented Oct 30, 2024

CodSpeed Performance Report

Merging #3149 will degrade performances by 14.33%

Comparing sagiahrac:minimal-uint-type-for-indices (4f6dacd) with main (a61e8a4)

Summary

⚡ 1 improvements
❌ 1 regressions
✅ 15 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main sagiahrac:minimal-uint-type-for-indices Change
test_count[1 Small File] 3.8 ms 3.3 ms +15.11%
test_iter_rows_first_row[100 Small Files] 264.1 ms 308.3 ms -14.33%

@sagiahrac sagiahrac marked this pull request as draft October 30, 2024 15:53
Copy link
Member

@samster25 samster25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense! Main question I have is that this would only apply to FixedShape tensors Sparse right?

src/daft-core/src/array/ops/cast.rs Outdated Show resolved Hide resolved
@sagiahrac sagiahrac marked this pull request as ready for review November 4, 2024 07:57
sagiahrac and others added 15 commits November 4, 2024 10:06
## The Rationales

Thanks to the [great
work](Eventual-Inc#3018) from
@universalmind303 , Daft now supports `INTERVAL` type exposed from
`arrow2`.

Beyond DataFrame supports, this PR aims to unlock SQL simple `INTERVAL`
usage in SQL syntax, mainly copied from
[planner.rs](https://github.com/sgl-project/sglang/pull/1790/files#diff-ea02b059cdabc0939616c35c6566dbcf980a5794306dedd241c2823afd9b2db2).

Notes: This naive impl doesn't fully support complex interval scenarios,
like leap year or relative duration addition and subtraction. We might
need more carefully handled logic as the follow ups.

---------

Signed-off-by: Austin Liu <[email protected]>
* Removes Int128 Type
* Refactor Decimal128 to be backed by a DataArray rather than a
LogicalArray
* Implements math operations for Decimal
* Implements comparison operations for Decimal
Likely also increases performance due to removing heap alloc in some
places.

---------

Co-authored-by: Colin Ho <[email protected]>
…#2776)

Bumps
[slackapi/slack-github-action](https://github.com/slackapi/slack-github-action)
from 1.26.0 to 1.27.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/slackapi/slack-github-action/releases">slackapi/slack-github-action's
releases</a>.</em></p>
<blockquote>
<h2>Slack Send V1.27.0</h2>
<h2>What's changed</h2>
<p>This release introduces an optional <code>payload-delimiter</code>
parameter for flattening nested objects with a customized delimiter
before the payload is sent to Slack Workflow Builder when using workflow
webhook triggers.</p>
<pre lang="diff"><code>  - name: Send a custom flattened payload
    uses: slackapi/[email protected]
+   with:
+     payload-delimiter: &quot;_&quot;
    env:
      SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
</code></pre>
<p>Setting this value to an underscore (<code>_</code>) is recommended
when using nested inputs within Workflow Builder to match expected input
formats of Workflow Builder, but the actual value can be changed to
something else! This &quot;flattening&quot; behavior
<strong>did</strong> exist prior to this version, but used a period
(<code>.</code>) which is not valid for webook inputs in Workflow
Builder.</p>
<!-- raw HTML omitted -->
<p>The resulting output of flattened objects is not always clear, but
the following can hopefully serve as a quick reference as well as <a
href="https://github.com/slackapi/slack-github-action/blob/5d1fb07d3c4f410b8d278134c714edff31264beb/test/slack-send-test.js#L264-L319">these
specs</a> when using <code>_</code> as the delimiter:</p>
<p><strong>Input</strong>:</p>
<pre lang="json"><code>{
    &quot;apples&quot;: &quot;tree&quot;,
    &quot;bananas&quot;: {
        &quot;truthiness&quot;: true
    }
}
</code></pre>
<p><strong>Output</strong>:</p>
<pre lang="json"><code>{
    &quot;apples&quot;: &quot;tree&quot;,
    &quot;bananas_truthiness&quot;: &quot;true&quot;
}
</code></pre>
<p>Notice that <code>bananas_truthiness</code> is also stringified in
this process, as part of updating values to match the expected inputs of
Workflow Builder!</p>
<!-- raw HTML omitted -->
<h2>Changes</h2>
<p>In addition to the changes above, the following lists all of the
changes since the prior version with the <strong>complete
changelog</strong> changes found here: <a
href="https://github.com/slackapi/slack-github-action/compare/v1.26.0...v1.27.0">https://github.com/slackapi/slack-github-action/compare/v1.26.0...v1.27.0</a></p>
<h4>🎁 Enhancements</h4>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/slackapi/slack-github-action/commit/37ebaef184d7626c5f204ab8d3baff4262dd30f0"><code>37ebaef</code></a>
Automatic compilation</li>
<li><a
href="https://github.com/slackapi/slack-github-action/commit/5d1fb07d3c4f410b8d278134c714edff31264beb"><code>5d1fb07</code></a>
chore(release): tag version 1.27.0</li>
<li><a
href="https://github.com/slackapi/slack-github-action/commit/3bc06716971bb1dc2899ccd0332da69b8b778356"><code>3bc0671</code></a>
chore(deps): bump axios to 1.7.5 (<a
href="https://redirect.github.com/slackapi/slack-github-action/issues/332">#332</a>)</li>
<li><a
href="https://github.com/slackapi/slack-github-action/commit/b452451af72f751bd902edfbbc084a8b2e6e5031"><code>b452451</code></a>
feat: make the payload delimiter configurable for workflow webhook
triggers (...</li>
<li><a
href="https://github.com/slackapi/slack-github-action/commit/c50e848fe18b1da5665e19286e3c9b86ad1b3bf5"><code>c50e848</code></a>
build(deps-dev): bump mocha from 10.5.2 to 10.7.0 (<a
href="https://redirect.github.com/slackapi/slack-github-action/issues/328">#328</a>)</li>
<li><a
href="https://github.com/slackapi/slack-github-action/commit/e4a9c4b6853f8b64ba9fee848d3f30198f9427c1"><code>e4a9c4b</code></a>
build(deps): bump <code>@​slack/web-api</code> from 7.2.0 to 7.3.2 (<a
href="https://redirect.github.com/slackapi/slack-github-action/issues/327">#327</a>)</li>
<li><a
href="https://github.com/slackapi/slack-github-action/commit/9a7f0fa18816ae797b801ec2c27a04499fc2381b"><code>9a7f0fa</code></a>
build(deps-dev): bump chai from 4.4.1 to 4.5.0 (<a
href="https://redirect.github.com/slackapi/slack-github-action/issues/326">#326</a>)</li>
<li><a
href="https://github.com/slackapi/slack-github-action/commit/73b7062b8dccf12c0d62626d19953ea628e418ba"><code>73b7062</code></a>
build(deps-dev): bump eslint-plugin-jsdoc from 48.5.0 to 48.10.2 (<a
href="https://redirect.github.com/slackapi/slack-github-action/issues/325">#325</a>)</li>
<li><a
href="https://github.com/slackapi/slack-github-action/commit/3d5207b5cf109bd2640ec20613ed7f29ab46e853"><code>3d5207b</code></a>
build(deps): bump https-proxy-agent from 7.0.4 to 7.0.5 (<a
href="https://redirect.github.com/slackapi/slack-github-action/issues/320">#320</a>)</li>
<li><a
href="https://github.com/slackapi/slack-github-action/commit/4e15b6a964ca554d1a7b7a56850baa97e8316be2"><code>4e15b6a</code></a>
build(deps): bump <code>@​slack/web-api</code> from 7.0.4 to 7.2.0 (<a
href="https://redirect.github.com/slackapi/slack-github-action/issues/323">#323</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/slackapi/slack-github-action/compare/v1.26.0...v1.27.0">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=slackapi/slack-github-action&package-manager=github_actions&previous-version=1.26.0&new-version=1.27.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

You can trigger a rebase of this PR by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

> **Note**
> Automatic rebases have been disabled on this pull request as it has
been open for over 30 days.

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [image](https://github.com/image-rs/image) from 0.24.9 to 0.25.4.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/image-rs/image/blob/main/CHANGES.md">image's
changelog</a>.</em></p>
<blockquote>
<h3>Version 0.25.4</h3>
<p>Features:</p>
<ul>
<li>Much faster decoding of lossless WebP due to a variety of
optimizations. Our benchmarks show 2x to 2.5x improvement.</li>
<li>Added support for orientation metadata, so that e.g. smartphone
camera images could be displayed correctly:
<ul>
<li>Added <code>ImageDecoder::orientation()</code> and implemented
orientation metadata extraction for JPEG, WebP and TIFF formats</li>
<li>Added <code>DynamicImage::apply_orientation()</code> to apply the
orientation to an image</li>
</ul>
</li>
<li>Added support for extracting Exif metadata from images via
<code>ImageDecoder::exif_metadata()</code>, and implemented it for JPEG
and WebP formats</li>
<li>Added <code>ImageEncoder::set_icc_profile()</code> and implemented
it for WebP format. Pull requests with implementations for other formats
are welcome.</li>
<li>Added <code>DynamicImage::fast_blur()</code> for a linear-time
approximation of Gaussian blur, which is much faster at larger blur
radii</li>
</ul>
<p>Bug fixes:</p>
<ul>
<li>Fixed some APNG images being decoded incorrectly</li>
<li>Fixed the iterator over animated WebP frames to return
<code>None</code> instead of an error when the end of the animation is
reached</li>
</ul>
<h3>Version 0.25.3</h3>
<p>Yanked! This version accidentally missed a commit that should have
been
included with the release. The <code>Orientation</code> struct should be
in the
appropriate module instead of the top-level. This release won't be
supported.</p>
<h3>Version 0.25.2</h3>
<p>Features:</p>
<ul>
<li>Added the HDR encoder to supported formats in generic write methods
with the
<code>hdr</code> feature enabled. Supports 32-bit float RGB color only,
for now.</li>
<li>When cloning <code>ImageBuffer</code>, <code>DynamicImage</code> and
<code>Frame</code> the existing buffer
will now be reused if possible.</li>
<li>Added <code>image::ImageReader</code> as an alias.</li>
<li>Implement <code>ImageEncoder</code> for
<code>HdrEncoder</code>.</li>
</ul>
<p>Structural changes</p>
<ul>
<li>Switch from <code>byteorder</code> to <code>byteorder-lite</code>,
consolidating some casting
unsafety to <code>bytemuck</code>.</li>
<li>Many methods on <code>DynamicImage</code> and buffers gained
<code>#[must_use]</code> indications.</li>
</ul>
<p>Bug fixes:</p>
<ul>
<li>Removed test data included in the crate archive.</li>
<li>The WebP animation decoder stops when reaching the indicate frame
count.</li>
<li>Fixed bugs in the <code>bmp</code> decoder.</li>
<li>Format support gated on the <code>exr</code> feature now compiles in
isolation.</li>
</ul>
<h3>Version 0.25.1</h3>
<p>Bug fixes:</p>
<ul>
<li>Fixed corrupt JPEG output when attempting to encode images
containing an alpha
channel.</li>
<li>Only accept &quot;.ff&quot; file extension for farbfeld images.</li>
<li>Correct farbfeld feature flag for
<code>ImageFormat::{reading_enabled, writing_enabled}</code>.</li>
<li>Disable strict mode for JPEG decoder.</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/image-rs/image/commit/0307a47de2ea14eea8a497a859724e7ee005773c"><code>0307a47</code></a>
Merge pull request <a
href="https://redirect.github.com/image-rs/image/issues/2354">#2354</a>
from image-rs/release-0.25.4</li>
<li><a
href="https://github.com/image-rs/image/commit/ac09ced4b3cba911934baae797512e4105a02d3b"><code>ac09ced</code></a>
Propose wording for republishing as 0.25.4</li>
<li><a
href="https://github.com/image-rs/image/commit/5e6bf4fd3c77b0eeaae0a64216e9321b56f16cf1"><code>5e6bf4f</code></a>
Merge pull request <a
href="https://redirect.github.com/image-rs/image/issues/2352">#2352</a>
from image-rs/changelog-update</li>
<li><a
href="https://github.com/image-rs/image/commit/42d1396eb4ef250605bd83c999e45c4106bd5b90"><code>42d1396</code></a>
Drop incorrect changelog entry</li>
<li><a
href="https://github.com/image-rs/image/commit/d52a194e5c3fa304143cc71d85d551e88fd211d9"><code>d52a194</code></a>
Merge pull request <a
href="https://redirect.github.com/image-rs/image/issues/2347">#2347</a>
from Shnatsel/new-release</li>
<li><a
href="https://github.com/image-rs/image/commit/fe94eabb7f7491b9ba9378ea5ece2f8884c30c65"><code>fe94eab</code></a>
Mention lossless WebP improvements</li>
<li><a
href="https://github.com/image-rs/image/commit/5976c195939bfbede976fe1e0a80225d192a793c"><code>5976c19</code></a>
Merge pull request <a
href="https://redirect.github.com/image-rs/image/issues/2349">#2349</a>
from Shnatsel/orientation-in-metadata</li>
<li><a
href="https://github.com/image-rs/image/commit/91a001f23146d3fdb47c8eca9a4b19ebea3e4fc6"><code>91a001f</code></a>
Don't import orientation in doc example</li>
<li><a
href="https://github.com/image-rs/image/commit/693079d51491bf0ab4c41403520f2dceba6dd3a0"><code>693079d</code></a>
Reword ravif changelog entry</li>
<li><a
href="https://github.com/image-rs/image/commit/fb5799bd8fdfac399c9b40817b62a98dada19a1b"><code>fb5799b</code></a>
Move Orientation to metadata module</li>
<li>Additional commits viewable in <a
href="https://github.com/image-rs/image/compare/v0.24.9...v0.25.4">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=image&package-manager=cargo&previous-version=0.24.9&new-version=0.25.4)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [adlfs](https://github.com/fsspec/adlfs) from 2023.10.0 to
2024.7.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/fsspec/adlfs/releases">adlfs's
releases</a>.</em></p>
<blockquote>
<h2>2024.7.0</h2>
<h2>What's Changed</h2>
<ul>
<li>Fix account host by <a
href="https://github.com/dorbaker"><code>@​dorbaker</code></a> in <a
href="https://redirect.github.com/fsspec/adlfs/pull/480">fsspec/adlfs#480</a></li>
<li>Allow blobs and file systems to pickle by <a
href="https://github.com/ghidalgo3"><code>@​ghidalgo3</code></a> in <a
href="https://redirect.github.com/fsspec/adlfs/pull/479">fsspec/adlfs#479</a></li>
<li>support signed urls via connection string alone by <a
href="https://github.com/shcheklein"><code>@​shcheklein</code></a> in <a
href="https://redirect.github.com/fsspec/adlfs/pull/478">fsspec/adlfs#478</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a href="https://github.com/dorbaker"><code>@​dorbaker</code></a>
made their first contribution in <a
href="https://redirect.github.com/fsspec/adlfs/pull/480">fsspec/adlfs#480</a></li>
<li><a href="https://github.com/ghidalgo3"><code>@​ghidalgo3</code></a>
made their first contribution in <a
href="https://redirect.github.com/fsspec/adlfs/pull/479">fsspec/adlfs#479</a></li>
<li><a
href="https://github.com/shcheklein"><code>@​shcheklein</code></a> made
their first contribution in <a
href="https://redirect.github.com/fsspec/adlfs/pull/478">fsspec/adlfs#478</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/fsspec/adlfs/compare/2024.4.1...2024.7.0">https://github.com/fsspec/adlfs/compare/2024.4.1...2024.7.0</a></p>
<h2>2024.4.1</h2>
<h2>What's Changed</h2>
<ul>
<li>Honor the anon parameter if set by <a
href="https://github.com/adam-roughton"><code>@​adam-roughton</code></a>
in <a
href="https://redirect.github.com/fsspec/adlfs/pull/468">fsspec/adlfs#468</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a
href="https://github.com/adam-roughton"><code>@​adam-roughton</code></a>
made their first contribution in <a
href="https://redirect.github.com/fsspec/adlfs/pull/468">fsspec/adlfs#468</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/fsspec/adlfs/compare/2024.4.0...2024.4.1">https://github.com/fsspec/adlfs/compare/2024.4.0...2024.4.1</a></p>
<h2>2024.4.0</h2>
<h2>What's Changed</h2>
<ul>
<li>add missing await on delete_blob call per issue 459 by <a
href="https://github.com/johnmacnamararseg"><code>@​johnmacnamararseg</code></a>
in <a
href="https://redirect.github.com/fsspec/adlfs/pull/460">fsspec/adlfs#460</a></li>
<li>format via black and add installation of dev deps to contributing
docs by <a
href="https://github.com/johnmacnamararseg"><code>@​johnmacnamararseg</code></a>
in <a
href="https://redirect.github.com/fsspec/adlfs/pull/464">fsspec/adlfs#464</a></li>
<li>Make AzureBlobFileSystem anon behaviour configurable via env var. by
<a href="https://github.com/microft"><code>@​microft</code></a> in <a
href="https://redirect.github.com/fsspec/adlfs/pull/437">fsspec/adlfs#437</a></li>
<li>document that <code>credential</code> needs to be from
azure.identity.aio by <a
href="https://github.com/temporaer"><code>@​temporaer</code></a> in <a
href="https://redirect.github.com/fsspec/adlfs/pull/463">fsspec/adlfs#463</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a
href="https://github.com/johnmacnamararseg"><code>@​johnmacnamararseg</code></a>
made their first contribution in <a
href="https://redirect.github.com/fsspec/adlfs/pull/460">fsspec/adlfs#460</a></li>
<li><a href="https://github.com/microft"><code>@​microft</code></a> made
their first contribution in <a
href="https://redirect.github.com/fsspec/adlfs/pull/437">fsspec/adlfs#437</a></li>
<li><a href="https://github.com/temporaer"><code>@​temporaer</code></a>
made their first contribution in <a
href="https://redirect.github.com/fsspec/adlfs/pull/463">fsspec/adlfs#463</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/fsspec/adlfs/compare/2024.2.0...2024.4.0">https://github.com/fsspec/adlfs/compare/2024.2.0...2024.4.0</a></p>
<h2>2024.2.0</h2>
<h2>What's Changed</h2>
<ul>
<li>fs.url(): expose response content headers for pre-signed URLs by <a
href="https://github.com/pmrowla"><code>@​pmrowla</code></a> in <a
href="https://redirect.github.com/fsspec/adlfs/pull/451">fsspec/adlfs#451</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/fsspec/adlfs/compare/2024.1.0...2024.2.0">https://github.com/fsspec/adlfs/compare/2024.1.0...2024.2.0</a></p>
<h2>2024.1.0</h2>
<h2>What's Changed</h2>
<ul>
<li>adlfs: fix version typo by <a
href="https://github.com/efiop"><code>@​efiop</code></a> in <a
href="https://redirect.github.com/fsspec/adlfs/pull/449">fsspec/adlfs#449</a></li>
<li>Check for Hdi_isfolder with a capital by <a
href="https://github.com/basnijholt"><code>@​basnijholt</code></a> in <a
href="https://redirect.github.com/fsspec/adlfs/pull/418">fsspec/adlfs#418</a></li>
<li>put_file: default to overwrite=True by <a
href="https://github.com/pmrowla"><code>@​pmrowla</code></a> in <a
href="https://redirect.github.com/fsspec/adlfs/pull/419">fsspec/adlfs#419</a></li>
<li>Fix recursive delete on hierarchical namespace accounts by <a
href="https://github.com/Tom-Newton"><code>@​Tom-Newton</code></a> in <a
href="https://redirect.github.com/fsspec/adlfs/pull/454">fsspec/adlfs#454</a></li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/fsspec/adlfs/blob/main/CHANGELOG.md">adlfs's
changelog</a>.</em></p>
<blockquote>
<p><strong>Change Log</strong></p>
<h2>Unreleased</h2>
<ul>
<li><code>AzureBlobFileSystem</code> and <code>AzureBlobFile</code>
support pickling.</li>
<li>Handle mixed casing for <code>hdi_isfolder</code> metadata when
determining whether a blob should be treated as a folder.</li>
<li><code>_put_file</code>: <code>overwrite</code> now defaults to
<code>True</code>.</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/fsspec/adlfs/compare/2023.10.0...2024.7.0">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=adlfs&package-manager=pip&previous-version=2023.10.0&new-version=2024.7.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

You can trigger a rebase of this PR by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

> **Note**
> Automatic rebases have been disabled on this pull request as it has
been open for over 30 days.

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
I'm using RustRover to code and debug Rust related code and I noticed
the debug run in RustRover doesn't work out of box: there's no variable
showing when breakpoint is hit.

It turns out that the RustRover IDE will launch debug process with test
profile, which is inherited from dev[^1] profile. I am not sure why the
dev[^2] profile in this project is configured without debug enabled. I
add the test profile with debug enabled in this PR.

[^1]: https://doc.rust-lang.org/cargo/reference/profiles.html#test
[^2]: https://github.com/Eventual-Inc/Daft/blob/main/Cargo.toml#L86
conradsoon and others added 7 commits November 4, 2024 10:45
…. /tmp/**.csv) (Eventual-Inc#3100)

Closes Eventual-Inc#1820.

Main issue seems to be that the `globset` crate is permissive for what
kind of pattern it builds (no error is thrown when we try to build a
pattern for `/tmp/**.csv`, for instance, so we have to check ourselves
for any such patterns.
Streaming writes for swordfish (parquet + csv only). Iceberg and delta
writes are here: Eventual-Inc#2966

Implement streaming writes as a blocking sink. Unpartitioned writes run
with 1 worker, and Partitioned writes run with NUM_CPUs workers. As a
drive by, made blocking sinks parallelizable.

**Behaviour**
- Unpartitioned: Make writes to a `TargetFileSizeWriter`, which manages
file sizes and row group sizes, as data is streamed in.

- Partitioned: Partition data via a `Dispatcher` and send to workers
based on the hash. Each worker runs a `PartitionedWriter` that manages
partitioning by value, file sizes, and row group sizes.


**Benchmarks:**
I made a new benchmark suite in
`tests/benchmarks/test_streaming_writes.py`, it tests writes of tpch
lineitem to parquet/csv with/without partition columns and different
file/rowgroup size. The streaming executor performs much better when
there are partition columns, as seen in this screenshot. Without
partition columns it is about the same, when target row group size /
file size is decreased, it is slightly slower. Likely due to the fact
that probably does more slicing, but will need to investigate more.
Memory usage is the same for both.
<img width="1400" alt="Screenshot 2024-10-03 at 11 22 32 AM"
src="https://github.com/user-attachments/assets/53b4d77d-553a-4181-8a4d-9eddaa3adaf7">

Memory test on read->write parquet tpch lineitem sf1:
Native:
<img width="1078" alt="Screenshot 2024-10-08 at 1 48 34 PM"
src="https://github.com/user-attachments/assets/3eda33c6-9413-415f-b808-ac3c7437e269">

Python:
<img width="1090" alt="Screenshot 2024-10-08 at 1 48 50 PM"
src="https://github.com/user-attachments/assets/f92b9a9f-a3b5-408b-98d5-4ba2d66b7be4">

---------

Co-authored-by: Colin Ho <[email protected]>
Co-authored-by: Colin Ho <[email protected]>
Co-authored-by: Colin Ho <[email protected]>
Spawns compute tasks on joinsets so that they can be cancelled.

---------

Co-authored-by: Colin Ho <[email protected]>
This PR marks `PartitionTasks` as done only after they have been
explicitly marked as done by the runner.

Previously, we used the existence of the `.results` on a PartitionTask
to determine whether or not it is done. However, this is not quite
correct in the case of the RayRunner, which will attach a result
containing a Ray ObjectRef, which is a future. This future may not (and
is likely not) be completed yet at the time of PartitionTask creation.

---------

Co-authored-by: Jay Chia <[email protected]@users.noreply.github.com>
@jaychia @colin-ho Just added temporal doc section to expressions.rst.
Let me know what you think of the content and then we can finalize which
page or user-guide section to put it on from there. Thanks!

---------

Co-authored-by: Colin Ho <[email protected]>
a whole bunch of boilerplate for tpc-ds benchmarking and testing. 

wanted to keep this separate from others as there's not much
functionality here, just adding a `dsdgen` command to the makefile to
generate tpc-ds datasets. I called it `dsdgen` because that's what
duckdb calls it, and this uses the duckdb implementation to generate all
of the datasets.

The answers were copied from
[duckdb/duckdb/extension/tpcds/dsdgen/answers](https://github.com/duckdb/duckdb/tree/10c42435f1805ee4415faa5d6da4943e8c98fa55/extension/tpcds/dsdgen/answers)

Usage:

```sh
# defaults to sf=1 and dir=data/tpc-ds
> make dsdgen
> make dsdgen SCALE_FACTOR=<scale_factor> OUTPUT_DIR=<output_dir>
```

## Notes for reviewer

Most files here are boilerplate. 

The only relevant files are:
- Makefile
- requirements_dev.txt
- benchmarking/tpc-ds/datagen.py
When running in a Ray Job, without the user invoking any Ray commands or
`ray.init()` explicitly, the `ray.is_initialized()` function returns
False.

This means that Daft "does not know" that it is running inside of a Ray
cluster, and thus will not default to using the RayRunner. This can lead
to unexpected behavior when using `daft-launcher` because a user must
know to call `daft.context.set_runner_ray()`.

This PR changes that behavior by attempting to look up the `$RAY_JOB_ID`
environment variable, as a heuristic to tell whether or not it is
currently running inside of a Ray job.

To test, I just ran a Ray job and called `daft.context.get_context()`
after initializing a Daft dataframe

<img width="1350" alt="image"
src="https://github.com/user-attachments/assets/0a6d8ae4-034a-424d-a3d7-9311d08be454">

---------

Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: Jay Chia <[email protected]@users.noreply.github.com>
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 4, 2024
@sagiahrac
Copy link
Contributor Author

sagiahrac commented Nov 4, 2024

@samster25 I also added a FixedShapeSparseTensor->Python cast implementation, preserving the indices data type. Previously, we were casting FixedShapeSparseTensor->SparseTensor->Python, which implicitly converted all indices to uint64.

@sagiahrac
Copy link
Contributor Author

This makes sense! Main question I have is that this would only apply to FixedShape tensors Sparse right?

True, to avoid dtype ambiguity for dynamic sparse tensors (i.e. when concatenating 2 dataframes). What do you think?

src/daft-core/src/array/ops/cast.rs Outdated Show resolved Hide resolved
src/daft-schema/src/dtype.rs Outdated Show resolved Hide resolved
src/daft-core/src/array/ops/cast.rs Outdated Show resolved Hide resolved
@samster25
Copy link
Member

Hi @sagiahrac! Just took a look, just a few minor requests!

src/daft-core/src/array/ops/sparse_tensor.rs Outdated Show resolved Hide resolved
src/daft-core/src/array/ops/cast.rs Outdated Show resolved Hide resolved
@samster25 samster25 enabled auto-merge (squash) November 13, 2024 06:41
@samster25 samster25 merged commit 0d2bb2a into Eventual-Inc:main Nov 13, 2024
39 of 40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants