[Research / Analysis] how to effectively embed quote tweets #1

AbrahamSanders · 2020-04-11T22:03:14Z

Any given tweet is either an original, a retweet, or a "quote tweet" (retweet with comments).

The challenge with quote tweets is that there are actually two linked tweets instead of a single tweet to embed. For example:

Quoted text: "I want to ride a Harley cross country."
Relpy text:  "Me too!"

Clearly the reply text is meaningless without the context of the quoted text, so it doesn't make sense to embed it by itself.

There are two options for embedding quote tweets. Let embed("text") be the embedding vector for some "text":

1. Embed a concatenation of the quote and reply texts: embed('["I want to ride a Harley cross country."] Me too!')
2. Embed the quoted and reply texts separately and combine the vectors using weighted addition or component-wise max:

a. a * embed('I want to ride a Harley cross country.') + b * embed('Me too!')
(where in the simplest case a=b=1)
b. max(embed('I want to ride a Harley cross country.'), embed('Me too!'))

In the absence of tweet pairs labeled with ground-truth semantic relatedness, these approaches cannot be easily compared quantitatively.

Open questions:

How to best compare these approaches qualitatively without launching a large-scale human study?
- For example, how well do quote tweets embedded with each approach "fit" within their cluster and top-k nearest neighbors on a scale of objective appropriateness?
- For example, how are groups of quote tweets with the same quoted text and different reply text distributed about the embedding space?
  - a. Are they clustered together?
  - b. Are they clustered with other tweets that share similar meaning with the reply text?
  - c. Can the weighted sum embedding approach strike a good balance between these?
Can a task with weak supervision be engineered to substitute for hard supervision?
- For example, how well does each embedding predict term frequency distribution in the top-k nearest neighbors?

To facilitate analysis in this area I will be including embeddings of all three types for each quote tweet in Elasticsearch.

The text was updated successfully, but these errors were encountered:

AbrahamSanders self-assigned this Apr 11, 2020

AbrahamSanders mentioned this issue Apr 11, 2020

[Research / Analysis] Fine-tune embedding model on tweet dataset #2

Open

AbrahamSanders added the research needs investigation or trial of one or more approaches label Apr 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Research / Analysis] how to effectively embed quote tweets #1

[Research / Analysis] how to effectively embed quote tweets #1

AbrahamSanders commented Apr 11, 2020

[Research / Analysis] how to effectively embed quote tweets #1

[Research / Analysis] how to effectively embed quote tweets #1

Comments

AbrahamSanders commented Apr 11, 2020