You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Any given tweet is either an original, a retweet, or a "quote tweet" (retweet with comments).
The challenge with quote tweets is that there are actually two linked tweets instead of a single tweet to embed. For example:
Quoted text: "I want to ride a Harley cross country."
Relpy text: "Me too!"
Clearly the reply text is meaningless without the context of the quoted text, so it doesn't make sense to embed it by itself.
There are two options for embedding quote tweets. Let embed("text") be the embedding vector for some "text":
1. Embed a concatenation of the quote and reply texts:embed('["I want to ride a Harley cross country."] Me too!') 2. Embed the quoted and reply texts separately and combine the vectors using weighted addition or component-wise max:
a. a * embed('I want to ride a Harley cross country.') + b * embed('Me too!')
(where in the simplest case a=b=1)
b. max(embed('I want to ride a Harley cross country.'), embed('Me too!'))
In the absence of tweet pairs labeled with ground-truth semantic relatedness, these approaches cannot be easily compared quantitatively.
Open questions:
How to best compare these approaches qualitatively without launching a large-scale human study?
For example, how well do quote tweets embedded with each approach "fit" within their cluster and top-k nearest neighbors on a scale of objective appropriateness?
For example, how are groups of quote tweets with the same quoted text and different reply text distributed about the embedding space?
a. Are they clustered together?
b. Are they clustered with other tweets that share similar meaning with the reply text?
c. Can the weighted sum embedding approach strike a good balance between these?
Can a task with weak supervision be engineered to substitute for hard supervision?
For example, how well does each embedding predict term frequency distribution in the top-k nearest neighbors?
To facilitate analysis in this area I will be including embeddings of all three types for each quote tweet in Elasticsearch.
The text was updated successfully, but these errors were encountered:
Any given tweet is either an original, a retweet, or a "quote tweet" (retweet with comments).
The challenge with quote tweets is that there are actually two linked tweets instead of a single tweet to embed. For example:
Clearly the reply text is meaningless without the context of the quoted text, so it doesn't make sense to embed it by itself.
There are two options for embedding quote tweets. Let embed("text") be the embedding vector for some "text":
1. Embed a concatenation of the quote and reply texts:
embed('["I want to ride a Harley cross country."] Me too!')
2. Embed the quoted and reply texts separately and combine the vectors using weighted addition or component-wise max:
a * embed('I want to ride a Harley cross country.') + b * embed('Me too!')
(where in the simplest case a=b=1)
max(embed('I want to ride a Harley cross country.'), embed('Me too!'))
In the absence of tweet pairs labeled with ground-truth semantic relatedness, these approaches cannot be easily compared quantitatively.
Open questions:
How to best compare these approaches qualitatively without launching a large-scale human study?
Can a task with weak supervision be engineered to substitute for hard supervision?
To facilitate analysis in this area I will be including embeddings of all three types for each quote tweet in Elasticsearch.
The text was updated successfully, but these errors were encountered: