Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/223 unify pandas dtype inference #296

Merged
merged 10 commits into from
Oct 24, 2023
8 changes: 4 additions & 4 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@ Technical details on how to contribute can be found in our [documentation](https

There are several ways you can contribute to Spotlight:

* Fix outstanding issues.
* Implement new features.
* Submit issues related to bugs or desired new features.
* Share your use case
- Fix outstanding issues.
- Implement new features.
- Submit issues related to bugs or desired new features.
- Share your use case

If you don't know where to start, you might want to have a look at [hacktoberfest issues](https://github.com/Renumics/spotlight/issues?q=is%3Aissue+is%3Aopen+label%3Ahacktoberfest)
and our guide on how to create a [new Lens](https://renumics.com/docs/development/lenses).
15 changes: 7 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,10 @@

<p align="center"><a href="https://spotlight.renumics.com"><img src="static/img/spotlight_video.gif" width="100%"/></a></p>

Spotlight helps you to **understand unstructured datasets** fast. You can quickly create **interactive visualizations** and leverage data enrichments (e.g. embeddings, prediction, uncertainties) to **identify critical clusters** in your data.
Spotlight helps you to **understand unstructured datasets** fast. You can quickly create **interactive visualizations** and leverage data enrichments (e.g. embeddings, prediction, uncertainties) to **identify critical clusters** in your data.

Spotlight supports most unstructured data types including **images, audio, text, videos, time-series and geometric data**. You can start from your existing dataframe:

<p align="left"><img src="static/img/dataframe_head_sample.png" width="100%"/></a></p>

And start Spotlight with just a few lines of code:
Expand Down Expand Up @@ -49,7 +50,7 @@ Machine learning and engineering teams use Spotlight to understand and communica
<td rowspan="3">[Classification]</td>
<td>Find Issues in Any Image Classification Dataset</td>
<td><a href="https://www.renumics.com/next/docs/use-cases/image-classification">👨‍💻</a> <a href="https://medium.com/@daniel-klitzke/finding-problematic-data-slices-in-unstructured-data-aeec0a3b9a2a">📝</a> <a href="https://huggingface.co/spaces/renumics/sliceguard-unstructured-data">🕹️</a></td>
</tr>
</tr>
<tr>
<td>Find data issues in the CIFAR-100 image dataset</td>
<td><a href="https://huggingface.co/spaces/renumics/navigate-data-issues">🕹️</a></td>
Expand Down Expand Up @@ -91,7 +92,6 @@ Machine learning and engineering teams use Spotlight to understand and communica
</tbody>
</table>


## ⏱️ Quickstart

Get started by installing Spotlight and loading your first dataset.
Expand Down Expand Up @@ -132,12 +132,11 @@ ds = datasets.load_dataset('renumics/emodb-enriched', split='all')
layout= spotlight.layouts.debug_classification(label='gender', prediction='m1_gender_prediction', embedding='m1_embedding', features=['age', 'emotion'])
spotlight.show(ds, layout=layout)
```

Here, the data types are discovered automatically from the dataset and we use a pre-defined layout for model debugging. Custom layouts can be built programmatically or via the UI.

> The `datasets[audio]` package can be installed via pip.



#### Usage Tracking

We have added crash report and performance collection. We do NOT collect user data other than an anonymized Machine Id obtained by py-machineid, and only log our own actions. We do NOT collect folder names, dataset names, or row data of any kind only aggregate performance statistics like total time of a table_load, crash data, etc. Collecting Spotlight crashes will help us improve stability. To opt out of the crash report collection define an environment variable called `SPOTLIGHT_OPT_OUT` and set it to true. e.G.`export SPOTLIGHT_OPT_OUT=true`
Expand All @@ -150,9 +149,9 @@ We have added crash report and performance collection. We do NOT collect user da

## Learn more about unstructured data workflows

- 🤗 [Huggingface](https://huggingface.co/renumics) example spaces and datasets
- 🏀 [Playbook](https://renumics.com/docs/playbook/) for data-centric AI workflows
- 🍰 [Sliceguard](https://github.com/Renumics/sliceguard) library for automatic slice detection
- 🤗 [Huggingface](https://huggingface.co/renumics) example spaces and datasets
- 🏀 [Playbook](https://renumics.com/docs/playbook/) for data-centric AI workflows
- 🍰 [Sliceguard](https://github.com/Renumics/sliceguard) library for automatic slice detection

## Contribute

Expand Down
16 changes: 2 additions & 14 deletions renumics/spotlight/data_source/data_source.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@

import pandas as pd
import numpy as np
from pydantic.dataclasses import dataclass

from renumics.spotlight.dataset.exceptions import (
ColumnExistsError,
Expand All @@ -30,17 +29,6 @@ class ColumnMetadata:
tags: List[str] = dataclasses.field(default_factory=list)


@dataclass
class CellsUpdate:
"""
A dataset's cell update.
"""

value: Any
author: str
edited_at: str


class DataSource(ABC):
"""abstract base class for different data sources"""

Expand All @@ -61,7 +49,7 @@ def column_names(self) -> List[str]:
@abstractmethod
def intermediate_dtypes(self) -> DTypeMap:
"""
The dtypes of intermediate values
The dtypes of intermediate values. Values for all columns must be filled.
"""

@property
Expand Down Expand Up @@ -94,7 +82,7 @@ def check_generation_id(self, generation_id: int) -> None:
@abstractmethod
def semantic_dtypes(self) -> DTypeMap:
"""
Semantic dtypes for viewer.
Semantic dtypes for viewer. Some values may be not present.
"""

@abstractmethod
Expand Down
62 changes: 40 additions & 22 deletions renumics/spotlight/data_store.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,16 @@
DType,
DTypeMap,
EmbeddingDType,
array_dtype,
is_array_dtype,
is_audio_dtype,
is_category_dtype,
is_embedding_dtype,
is_file_dtype,
is_str_dtype,
is_mixed_dtype,
is_bytes_dtype,
is_window_dtype,
str_dtype,
audio_dtype,
image_dtype,
Expand Down Expand Up @@ -173,33 +176,32 @@ def _guess_dtype(self, col: str) -> DType:
return semantic_dtype

sample_values = self._data_source.get_column_values(col, slice(10))
sample_dtypes = [_guess_value_dtype(value) for value in sample_values]

try:
mode_dtype = statistics.mode(sample_dtypes)
except statistics.StatisticsError:
sample_dtypes: List[DType] = []
for value in sample_values:
guessed_dtype = _guess_value_dtype(value)
if guessed_dtype is not None:
sample_dtypes.append(guessed_dtype)
if not sample_dtypes:
return semantic_dtype

return mode_dtype or semantic_dtype
mode_dtype = statistics.mode(sample_dtypes)
# For windows and embeddings, at least sample values must be aligned.
if is_window_dtype(mode_dtype) and any(
not is_window_dtype(dtype) for dtype in sample_dtypes
):
return array_dtype
if is_embedding_dtype(mode_dtype) and any(
(not is_embedding_dtype(dtype)) or dtype.length != mode_dtype.length
for dtype in sample_dtypes
):
return array_dtype

return mode_dtype


def _intermediate_to_semantic_dtype(intermediate_dtype: DType) -> DType:
if is_array_dtype(intermediate_dtype):
if intermediate_dtype.shape is None:
return intermediate_dtype
if intermediate_dtype.shape == (2,):
return window_dtype
if intermediate_dtype.ndim == 1 and intermediate_dtype.shape[0] is not None:
return EmbeddingDType(intermediate_dtype.shape[0])
if intermediate_dtype.ndim == 1 and intermediate_dtype.shape[0] is None:
return sequence_1d_dtype
if intermediate_dtype.ndim == 2 and (
intermediate_dtype.shape[0] == 2 or intermediate_dtype.shape[1] == 2
):
return sequence_1d_dtype
if intermediate_dtype.ndim == 3 and intermediate_dtype.shape[-1] in (1, 3, 4):
return image_dtype
return intermediate_dtype
return _guess_array_dtype(intermediate_dtype)
if is_file_dtype(intermediate_dtype):
return str_dtype
if is_mixed_dtype(intermediate_dtype):
Expand Down Expand Up @@ -262,5 +264,21 @@ def _guess_value_dtype(value: Any) -> Optional[DType]:
except (TypeError, ValueError):
pass
else:
return ArrayDType(value.shape)
return _guess_array_dtype(ArrayDType(value.shape))
return None


def _guess_array_dtype(dtype: ArrayDType) -> DType:
if dtype.shape is None:
return dtype
if dtype.shape == (2,):
return window_dtype
if dtype.ndim == 1 and dtype.shape[0] is not None:
return EmbeddingDType(dtype.shape[0])
if dtype.ndim == 1 and dtype.shape[0] is None:
return sequence_1d_dtype
if dtype.ndim == 2 and (dtype.shape[0] == 2 or dtype.shape[1] == 2):
return sequence_1d_dtype
if dtype.ndim == 3 and dtype.shape[-1] in (1, 3, 4):
return image_dtype
return dtype
10 changes: 2 additions & 8 deletions renumics/spotlight/dataset/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,12 +32,7 @@
from typing_extensions import TypeGuard

from renumics.spotlight.__version__ import __version__
from renumics.spotlight.io.pandas import (
infer_dtypes,
prepare_column,
is_string_mask,
stringify_columns,
)
from .pandas import create_typed_series, infer_dtypes, is_string_mask, prepare_column
from renumics.spotlight.typing import (
BoolType,
IndexType,
Expand All @@ -47,7 +42,6 @@
is_integer,
is_iterable,
)
from renumics.spotlight.io.pandas import create_typed_series
from renumics.spotlight.dtypes.conversion import prepare_path_or_url
from renumics.spotlight import dtypes as spotlight_dtypes

Expand Down Expand Up @@ -738,7 +732,7 @@ def from_pandas(
df = df.reset_index(level=df.index.names) # type: ignore
else:
df = df.copy()
df.columns = pd.Index(stringify_columns(df))
df.columns = pd.Index([str(column) for column in df.columns])

if dtypes is None:
dtypes = {}
Expand Down
Loading