Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NLP] Consider adding distinction in research filter for automatically classified posts vs prediction based classification #76

Open
ronentk opened this issue May 8, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@ronentk
Copy link
Contributor

ronentk commented May 8, 2024

For example, something like this

class SciFilterClassfication(Enum):
    NOT_CLASSIFIED = "not_classified"
    """ For posts automatically classified as research
    (for example based on citoid item types)"""
    RESEARCH_AUTO = "research_auto"
    """ For posts predicted to be related to research"""
    RESEARCH_PRED = "research_pred"
    """ For posts predicted to be unrelated to research"""
    NOT_RESEARCH = "not_research"

From the current form:

class SciFilterClassfication(Enum):
    NOT_CLASSIFIED = "not_classified"
    RESEARCH = "research"
    NOT_RESEARCH = "not_research"

The rationale is
1- it can help with the filter evaluation - differentiating between easy (auto) and hard cases (pred)
2 - we might want to use the information in the app to further organize the queue/UX

What do you think @ShaRefOh ?

@ronentk ronentk added the enhancement New feature or request label May 8, 2024
@ronentk ronentk self-assigned this May 8, 2024
@ShaRefOh
Copy link
Contributor

ShaRefOh commented May 8, 2024

We can present the data in a meaningful way, but not to evaluate it as a multi-label problem, as the True Labels are by def binary. What are the conditions for getting "research_auto"? I already have the types logged in the outcome dataset, I can simply use it to run an evaluation that includes aggregation of that data

@ronentk
Copy link
Contributor Author

ronentk commented May 8, 2024

item_types_whitelist = [
    "bookSection",
    "journalArticle",
    "preprint",
    "book",
    "manuscript",
    "thesis",
    "presentation",
    "conferencePaper",
    "report",
]


# if any item types on the whitelist, pass automatically
    if len(set(result.item_types).intersection(set(item_types_whitelist))) > 0:
        return SciFilterClassfication.RESEARCH

(https://github.com/Common-SenseMakers/sensemakers/blob/nlp-dev/nlp/desci_sense/shared_functions/filters/research_filter.py)

@ronentk
Copy link
Contributor Author

ronentk commented May 8, 2024

@ShaRefOh this condition holds for your annotations as well, right?

if len(set(result.item_types).intersection(set(item_types_whitelist))) > 0:
        return SciFilterClassfication.RESEARCH

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants