Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notes to add to README when there is time #19

Open
alexpghayes opened this issue Jan 26, 2022 · 0 comments
Open

Notes to add to README when there is time #19

alexpghayes opened this issue Jan 26, 2022 · 0 comments

Comments

@alexpghayes
Copy link
Collaborator

README beyond this point is really just scratch for myself

Sink nodes and unreachable nodes

citation_graph <- sample_pa(100)

citation_tracker <- appr(citation_graph, seeds = "5")
citation_tracker

Why should I use aPPR?

  • curious about nodes important to the community around a particular user who you wouldn't find without algorithmic help

  • 1 hop network is too small, 2-3 hop networks are too large (recall diameter of twitter graph is 3.7!!!)

  • want to study a particular community but don't know exactly which accounts to investigate, but you do have a good idea of one or two important accounts in that community

aPPR calculates an approximation

comment on p = 0 versus p != 0

Advice on choosing epsilon

Number of unique visits as a function of epsilon, wait times, runtime proportion to 1 / (alpha * epsilon), etc, etc

speaking strictly in terms of the p != 0 nodes

1e-4 and 1e-5: finishes quickly, neighbors with high degree get visited
1e-6: visits most of 1-hop neighborhood. finishes in several hours for accounts who follow thousands of people with ~10 tokens.
1e-7: visits beyond the 1-hop neighbor by ???. takes a couple days to run with ~10 tokens.
1e-8: visits a lot beyond the 1-hop neighbor, presumably the important people in the 2-hop neighbor, ???

the most disparate a users interests, and the less connected their neighborhood, the longer it will take to run aPPR

Limitations

  • Connected graph assumption, what results look like when we violate this assumption
  • Sampling is one node at a time

Speed ideas

compute is not an issue relative to actually getting data

Compute time ~ access from Ram time << access from disk time << access from network time.

Make requests to API in bulk, memoize everything, cache / write to disk in a separate process?

General pattern: cache on disk, and also in RAM

Working with Tracker objects

See ?Tracker for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant