Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Research / Analysis] how to effectively embed quote tweets #1

Open
AbrahamSanders opened this issue Apr 11, 2020 · 0 comments
Open
Assignees
Labels
research needs investigation or trial of one or more approaches

Comments

@AbrahamSanders
Copy link
Collaborator

Any given tweet is either an original, a retweet, or a "quote tweet" (retweet with comments).

The challenge with quote tweets is that there are actually two linked tweets instead of a single tweet to embed. For example:

Quoted text: "I want to ride a Harley cross country."
Relpy text:  "Me too!"

Clearly the reply text is meaningless without the context of the quoted text, so it doesn't make sense to embed it by itself.

There are two options for embedding quote tweets. Let embed("text") be the embedding vector for some "text":

1. Embed a concatenation of the quote and reply texts: embed('["I want to ride a Harley cross country."] Me too!')
2. Embed the quoted and reply texts separately and combine the vectors using weighted addition or component-wise max:

  • a. a * embed('I want to ride a Harley cross country.') + b * embed('Me too!')
    (where in the simplest case a=b=1)
  • b. max(embed('I want to ride a Harley cross country.'), embed('Me too!'))

In the absence of tweet pairs labeled with ground-truth semantic relatedness, these approaches cannot be easily compared quantitatively.

Open questions:

  1. How to best compare these approaches qualitatively without launching a large-scale human study?

    • For example, how well do quote tweets embedded with each approach "fit" within their cluster and top-k nearest neighbors on a scale of objective appropriateness?
    • For example, how are groups of quote tweets with the same quoted text and different reply text distributed about the embedding space?
      • a. Are they clustered together?
      • b. Are they clustered with other tweets that share similar meaning with the reply text?
      • c. Can the weighted sum embedding approach strike a good balance between these?
  2. Can a task with weak supervision be engineered to substitute for hard supervision?

    • For example, how well does each embedding predict term frequency distribution in the top-k nearest neighbors?

To facilitate analysis in this area I will be including embeddings of all three types for each quote tweet in Elasticsearch.

@AbrahamSanders AbrahamSanders self-assigned this Apr 11, 2020
@AbrahamSanders AbrahamSanders added the research needs investigation or trial of one or more approaches label Apr 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
research needs investigation or trial of one or more approaches
Projects
None yet
Development

No branches or pull requests

1 participant