Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Data Selection stub on the main README #50

Merged
merged 1 commit into from
Sep 16, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,7 +153,12 @@ Researchers have started tackling these challenges by developing new techniques

[Data Selection Area Page](data-selection.md)

_This description is a stub, you can help by improving it._
Massive amounts of data enabled many of the successes of deep learning, but this big data brings its own problems. Working with massive datasets is cumbersome and expensive in terms of both computational resources and labeling. Data selection methods, such as active learning and core-set selection, can mitigate the pains of big data by selecting the most valuable examples to label or train on.

While data selection has been a long-standing area in AI/ML, the scale and skew of modern, industrial datasets have pushed the field to more accurately value data and improve the scalability of selection methods. Recent works, such as ([Sener & Savarese](https://openreview.net/pdf?id=H1aIuk-RW) and [Ghorbani et al.](https://proceedings.mlr.press/v97/ghorbani19c/ghorbani19c.pdf)), take a more data-centric approach towards quantifying the contribution of each training example by focusing on diversity and representativeness rather than solely relying on model uncertainty. To help these methods scale, approaches, like [SVP](https://openreview.net/pdf?id=HJg2b0VYDr) and [SEALS](https://arxiv.org/pdf/2007.00077.pdf), present straightforward ways to reduce computational costs by up to three orders of magnitude, enabling web-scale active learning and data selection more broadly.

These advances in label and computational efficiency make data selection applicable to modern datasets, allowing AI/ML to take a more data-centric view focused on quality rather than quantity.


<h1 id="privacy">Data Privacy (Under Construction)</h1>

Expand Down