Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema definition #4

Open
gpavlov2016 opened this issue Oct 23, 2024 · 4 comments
Open

Schema definition #4

gpavlov2016 opened this issue Oct 23, 2024 · 4 comments

Comments

@gpavlov2016
Copy link

To jumpstart the conversation, what other mandatory fields do we need besides the following:

  • Paper Title
  • Author(s)
  • Code repository
@gpavlov2016
Copy link
Author

Max Vasiliev commented on Slack:

I was wondering what we mean by manditory? A paper title and its authors always go together, but a code repository may or may not be associated with some paper title/author, or any for that matter.
Suppose a famous paper spawns 100s of github projects, how do we show that? Or suppose the paper author(s) repo is broken and dead, while another has a thriving community and ends up spinning off into various companies and other projects, we want to show that and acknowledge the original paper/authors while also accurately describing the connection to and actors within the open-source world. Especially since the paper may go on to be referenced by other papers in the academic world, diverged from software side developments

@gpavlov2016
Copy link
Author

The project is about mapping research papers with code repositories, if I got it wrongs please correct me. Hence, code repository seems to be a requirement when submitting new entry to the database.

@jring-o
Copy link
Contributor

jring-o commented Oct 26, 2024

What this issue seems to be addressing is the "Papers" node, and what the necessary attributes and relationships are for that node. This is fantastic. I'm going to provide a bit more context on the larger project, then address minimum attributes and relationships for papers.

I am using the term "minimum" here because there are no "mandatory" attributes or relationships. Some papers might have all attributes (an author, title, code, etc.), while others might only have one or two. Both can exist in the graph.

Context

MOSS is about mapping an ecosystem, not only papers. Papers are one aspect of the ecosystem.

Four core questions related to this thread are:

  • What papers use what software to produce discoveries and/or patents?
  • What other projects do those papers cite?
  • Who builds those projects?
  • Who supports those people?

With this in mind, we might use these questions to guide us:

  • What are the minimum nodes needed to tell this story?
  • What are the minimum fields/attributes to map to those nodes?
  • What are the minimum relationships to map between those nodes?

The minimums we have identified so far are laid out here:

https://docs.google.com/document/d/1NEWtI7hqQA74jk9Geg8bwKVS3qTzV9hWAfEMQg_Y1gM/edit?tab=t.0

High level, the core nodes are:

  • People
  • Projects
  • Papers
  • Organizations

The core attributes and relationships can be viewed in the doc.

So the question becomes:

Is this model a good starting point? What are we missing? For example, in the "projects" node, I don't think we yet have a "depends on" relationship for mapping dependencies.

Paper Attributes

Here are the current attributes and relationships for papers:

Attributes

  • doi
  • title
  • description
  • url
  • published Date
  • has Public Data
  • has Public Code

Relationships

  • Cites another paper
  • Cites a project
  • Mentions a project
  • Is related to a domain

We capture authors through a relationship stemming from the "people" node.

So, the question becomes:

Are these good starting attributes and relationships for papers? What is missing?

@pheochromo
Copy link

pheochromo commented Oct 29, 2024

One concern I see is that mapping outward from paper space reaches only a subset of all projects without some work / interpretation / confidence score etc.

Great if (:Paper)-[:OFFICIAL_CODE]->(:Project) exists (paper links to authors repo)
But if the goal is to eventually quantify aggregate effective (:Paper)-{[:CONTRIBUTED_CODE]}->(:Project) to rank paper impact, you'd either need to resolve (:Project)-[:DEPENDS_ON]->(:Project) aka build a dependency scraper, or start classifying and hand-waving
ie. do all projects calling Hugging Face transformers library automatically connect to/count towards "Attention is all you need"s score? How tractable is this?

jring-o, its not just missing from the schema, its a key piece of the work we'd need to do.

In the other direction, (:Paper)-[:CITES]->(:Project) would show what exactly?
If we have (:Paper)-[:OFFICIAL_CODE]->(:Project) we can scrape official implementations dependencies upstream.
But then what about (:Paper)-[:MENTIONED]->(:Project)? .. No one even mentions LaTeX :')

Ex.

** OpenAlex has only the preprint of https://arxiv.org/abs/2405.21060, with 1 citation. But this work is already in active use and being further built upon.

I think this shows integration, but how much are those models actually being used? can we estimate based on code class names? forum discussions?
https://github.com/search?q=MambaForCausalLM&type=code

I also Imagine the number of definitive connections between papers and projects is relatively scarce compared to the total papers and projects. that is, most papers won't have official code. are we focusing on those that do? Maybe building off something like inclusion in HF transformers/Tensorflow/keras?

How do we feel about cycles? 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants