Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CV2-5373: Most relevant articles #2130

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
33bd366
CV2-5373: add a new graphql fields &
melsawy Nov 18, 2024
cfc7370
CV2-5373: update relay files
melsawy Nov 19, 2024
cb53dfa
Merge branch 'develop' into epic/CV2-5373-most-recent-articles-to-mos…
melsawy Nov 20, 2024
464eb3e
Merge branch 'develop' into epic/CV2-5373-most-recent-articles-to-mos…
melsawy Nov 23, 2024
b22f022
CV2-5730: return dummy relevant articles (#2136)
melsawy Nov 23, 2024
58d0ade
Merge branch 'develop' into epic/CV2-5373-most-recent-articles-to-mos…
melsawy Nov 25, 2024
03d9074
Merge branch 'develop' into epic/CV2-5373-most-recent-articles-to-mos…
melsawy Nov 26, 2024
236367e
CV2-5731 Refactoring smooch search (#2137)
melsawy Nov 26, 2024
fe3eda9
Merge branch 'develop' into epic/CV2-5373-most-recent-articles-to-mos…
melsawy Nov 29, 2024
f87308d
Merge branch 'develop' into epic/CV2-5373-most-recent-articles-to-mos…
melsawy Dec 2, 2024
5c95273
CV2-5761 list most relevant articles fact check and explainer for pro…
melsawy Dec 3, 2024
d51b47b
CV2-5373: check language exists for fc_language condition
melsawy Dec 4, 2024
8ba8fe4
CV2-5617: use workspace similarity settings for explainers (#2145)
melsawy Dec 10, 2024
df905f7
Merge branch 'develop' into epic/CV2-5373-most-recent-articles-to-mos…
melsawy Dec 10, 2024
f58a856
Add Helper Method to Create and Publish Standalone Fact Checks for Ch…
danielevalverde Dec 10, 2024
5d9d26d
CV2-5373: fix tests
melsawy Dec 11, 2024
20785b7
Merge branch 'epic/CV2-5373-most-recent-articles-to-most-relevant-art…
melsawy Dec 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions app/controllers/test_controller.rb
Original file line number Diff line number Diff line change
Expand Up @@ -225,6 +225,39 @@ def suggest_similarity_item
render_success 'suggest_similarity', pm1
end

def create_imported_standalone_fact_check
team = Team.current = Team.find(params[:team_id])
user = User.where(email: params[:email]).last
description = params[:description]
context = params[:context]
title = params[:title]
summary = params[:summary]
url = params[:url]
language = params[:language] || 'en'

# Create ClaimDescription
claim_description = ClaimDescription.create!(
description: description,
context: context,
user: user,
team: team
)

# Set up FactCheck
fact_check = FactCheck.new(
claim_description: claim_description,
title: title,
summary: summary,
url: url,
language: language,
user: user,
publish_report: true,
report_status: 'published'
)
fact_check.save!
render_success 'fact_check', fact_check
end

def random
render html: "<!doctype html><html><head><title>Test #{rand(100000).to_i}</title></head><body>Test</body></html>".html_safe
end
Expand Down
12 changes: 12 additions & 0 deletions app/graph/types/project_media_type.rb
Original file line number Diff line number Diff line change
Expand Up @@ -394,4 +394,16 @@ def articles_count
count += 1 if object.fact_check
count
end

field :relevant_articles, ::ArticleUnion.connection_type, null: true

def relevant_articles
object.get_similar_articles
end

field :relevant_articles_count, GraphQL::Types::Int, null: true

def relevant_articles_count
object.get_similar_articles.count
end
end
21 changes: 15 additions & 6 deletions app/models/bot/alegre.rb
Original file line number Diff line number Diff line change
Expand Up @@ -256,11 +256,12 @@ def self.merge_suggested_and_confirmed(suggested_or_confirmed, confirmed, pm)
end
end

def self.get_matching_key_value(pm, media_type, similarity_method, automatic, model_name)
def self.get_threshold_given_model_settings(team_id, media_type, similarity_method, automatic, model_name)
tbi = nil
tbi = self.get_alegre_tbi(team_id) unless team_id.nil?
similarity_level = automatic ? 'matching' : 'suggestion'
generic_key = "#{media_type}_#{similarity_method}_#{similarity_level}_threshold"
specific_key = "#{media_type}_#{similarity_method}_#{model_name}_#{similarity_level}_threshold"
tbi = self.get_alegre_tbi(pm&.team_id)
settings = tbi.alegre_settings unless tbi.nil?
outkey = ""
value = nil
Expand All @@ -274,17 +275,25 @@ def self.get_matching_key_value(pm, media_type, similarity_method, automatic, mo
return [outkey, value]
end

def self.get_threshold_for_query(media_type, pm, automatic = false)
def self.get_matching_key_value(pm, media_type, similarity_method, automatic, model_name)
self.get_threshold_given_model_settings(pm&.team_id, media_type, similarity_method, automatic, model_name)
end

def self.get_similarity_methods_and_models_given_media_type_and_team_id(media_type, team_id, get_vector_settings)
similarity_methods = media_type == 'text' ? ['elasticsearch'] : ['hash']
models = similarity_methods.dup
if media_type == 'text' && !pm.nil?
models_to_use = [self.matching_model_to_use(pm.team_id)].flatten-[Bot::Alegre::ELASTICSEARCH_MODEL]
if media_type == 'text' && get_vector_settings
models_to_use = [self.matching_model_to_use(team_id)].flatten-[Bot::Alegre::ELASTICSEARCH_MODEL]
models_to_use.each do |model|
similarity_methods << 'vector'
models << model
end
end
similarity_methods.zip(models).collect do |similarity_method, model_name|
return similarity_methods.zip(models)
end

def self.get_threshold_for_query(media_type, pm, automatic = false)
self.get_similarity_methods_and_models_given_media_type_and_team_id(media_type, pm&.team_id, !pm.nil?).collect do |similarity_method, model_name|
key, value = self.get_matching_key_value(pm, media_type, similarity_method, automatic, model_name)
{ value: value.to_f, key: key, automatic: automatic, model: model_name}
end
Expand Down
45 changes: 23 additions & 22 deletions app/models/concerns/smooch_search.rb
Original file line number Diff line number Diff line change
Expand Up @@ -9,21 +9,22 @@ module ClassMethods
def search(app_id, uid, language, message, team_id, workflow, provider = nil)
platform = self.get_platform_from_message(message)
begin
limit = CheckConfig.get('most_relevant_team_limit', 3, :integer)
sm = CheckStateMachine.new(uid)
self.get_installation(self.installation_setting_id_keys, app_id) if self.config.blank?
RequestStore.store[:smooch_bot_provider] = provider unless provider.blank?
query = self.get_search_query(uid, message)
results = self.get_search_results(uid, query, team_id, language).collect{ |pm| Relationship.confirmed_parent(pm) }.uniq
results = self.get_search_results(uid, query, team_id, language, limit).collect{ |pm| Relationship.confirmed_parent(pm) }.uniq
reports = results.select{ |pm| pm.report_status == 'published' }.collect{ |pm| pm.get_dynamic_annotation('report_design') }.reject{ |r| r.nil? }.collect{ |r| r.report_design_to_tipline_search_result }.select{ |r| r.should_send_in_language?(language) }

# Extract explainers from matched media if they don't have published fact-checks but they have explainers
reports = results.collect{ |pm| pm.explainers.to_a }.flatten.uniq.first(3).map(&:as_tipline_search_result) if !results.empty? && reports.empty?
reports = results.collect{ |pm| pm.explainers.to_a }.flatten.uniq.first(limit).map(&:as_tipline_search_result) if !results.empty? && reports.empty?

# Search for explainers if fact-checks were not found
if reports.empty? && query['type'] == 'text'
explainers = self.search_for_explainers(uid, query['text'], team_id, language).first(3).select{ |explainer| explainer.as_tipline_search_result.should_send_in_language?(language) }
explainers = self.search_for_explainers(uid, query['text'], team_id, limit, language).select{ |explainer| explainer.as_tipline_search_result.should_send_in_language?(language) }
Rails.logger.info "[Smooch Bot] Text similarity search got #{explainers.count} explainers while looking for '#{query['text']}' for team #{team_id}"
results = explainers.collect{ |explainer| explainer.project_medias.to_a }.flatten.uniq.reject{ |pm| pm.blank? }.first(3)
results = explainers.collect{ |explainer| explainer.project_medias.to_a }.flatten.uniq.reject{ |pm| pm.blank? }.first(limit)
reports = explainers.map(&:as_tipline_search_result)
end

Expand Down Expand Up @@ -100,9 +101,9 @@ def reject_temporary_results(results)
end
end

def parse_search_results_from_alegre(results, after = nil, feed_id = nil, team_ids = nil)
def parse_search_results_from_alegre(results, limit, after = nil, feed_id = nil, team_ids = nil)
pms = reject_temporary_results(results).sort_by{ |a| [a[1][:model] != Bot::Alegre::ELASTICSEARCH_MODEL ? 1 : 0, a[1][:score]] }.to_h.keys.reverse.collect{ |id| Relationship.confirmed_parent(ProjectMedia.find_by_id(id)) }
filter_search_results(pms, after, feed_id, team_ids).uniq(&:id).sort_by{ |pm| pm.report_status == 'published' ? 0 : 1 }.first(3)
filter_search_results(pms, after, feed_id, team_ids).uniq(&:id).first(limit)
end

def date_filter(team_id)
Expand All @@ -127,14 +128,14 @@ def get_search_query(uid, last_message)
self.bundle_list_of_messages(list, last_message, true)
end

def get_search_results(uid, message, team_id, language)
def get_search_results(uid, message, team_id, language, limit)
results = []
begin
type = message['type']
after = self.date_filter(team_id)
query = message['text']
query = CheckS3.rewrite_url(message['mediaUrl']) unless type == 'text'
results = self.search_for_similar_published_fact_checks(type, query, [team_id], after, nil, language).select{ |pm| is_a_valid_search_result(pm) }
results = self.search_for_similar_published_fact_checks(type, query, [team_id], limit, after, nil, language).select{ |pm| is_a_valid_search_result(pm) }
rescue StandardError => e
self.handle_search_error(uid, e, language)
end
Expand All @@ -148,19 +149,19 @@ def normalized_query_hash(type, query, team_ids, after, feed_id, language)

# "type" is text, video, audio or image
# "query" is either a piece of text of a media URL
def search_for_similar_published_fact_checks(type, query, team_ids, after = nil, feed_id = nil, language = nil, skip_cache = false)
def search_for_similar_published_fact_checks(type, query, team_ids, limit, after = nil, feed_id = nil, language = nil, skip_cache = false)
if skip_cache
self.search_for_similar_published_fact_checks_no_cache(type, query, team_ids, after, feed_id, language)
self.search_for_similar_published_fact_checks_no_cache(type, query, team_ids, limit, after, feed_id, language)
else
Rails.cache.fetch("smooch:search_results:#{self.normalized_query_hash(type, query, team_ids, after, feed_id, language)}", expires_in: 2.hours) do
self.search_for_similar_published_fact_checks_no_cache(type, query, team_ids, after, feed_id, language)
self.search_for_similar_published_fact_checks_no_cache(type, query, team_ids, limit, after, feed_id, language)
end
end
end

# "type" is text, video, audio or image
# "query" is either a piece of text of a media URL
def search_for_similar_published_fact_checks_no_cache(type, query, team_ids, after = nil, feed_id = nil, language = nil)
def search_for_similar_published_fact_checks_no_cache(type, query, team_ids, limit, after = nil, feed_id = nil, language = nil)
results = []
pm = nil
pm = ProjectMedia.new(team_id: team_ids[0]) if team_ids.size == 1 # We'll use the settings of a team instead of global settings when there is only one team
Expand All @@ -179,10 +180,10 @@ def search_for_similar_published_fact_checks_no_cache(type, query, team_ids, aft
words = text.split(/\s+/)
Rails.logger.info "[Smooch Bot] Search query (text): #{text}"
if Bot::Alegre.get_number_of_words(text) <= self.max_number_of_words_for_keyword_search
results = self.search_by_keywords_for_similar_published_fact_checks(words, after, team_ids, feed_id, language)
results = self.search_by_keywords_for_similar_published_fact_checks(words, after, team_ids, limit, feed_id, language)
else
alegre_results = Bot::Alegre.get_merged_similar_items(pm, [{ value: self.get_text_similarity_threshold }], Bot::Alegre::ALL_TEXT_SIMILARITY_FIELDS, text, team_ids)
results = self.parse_search_results_from_alegre(alegre_results, after, feed_id, team_ids)
results = self.parse_search_results_from_alegre(alegre_results, limit, after, feed_id, team_ids)
Rails.logger.info "[Smooch Bot] Text similarity search got #{results.count} results while looking for '#{text}' after date #{after.inspect} for teams #{team_ids}"
end
else
Expand All @@ -192,7 +193,7 @@ def search_for_similar_published_fact_checks_no_cache(type, query, team_ids, aft
media_url = self.save_locally_and_return_url(media_url, type, feed_id)
threshold = Bot::Alegre.get_threshold_for_query(type, pm)[0][:value]
alegre_results = Bot::Alegre.get_items_with_similar_media_v2(media_url: media_url, threshold: [{ value: threshold }], team_ids: team_ids, type: type)
results = self.parse_search_results_from_alegre(alegre_results, after, feed_id, team_ids)
results = self.parse_search_results_from_alegre(alegre_results, limit, after, feed_id, team_ids)
Rails.logger.info "[Smooch Bot] Media similarity search got #{results.count} results while looking for '#{query}' after date #{after.inspect} for teams #{team_ids}"
end
results
Expand Down Expand Up @@ -245,11 +246,11 @@ def should_restrict_by_language?(team_ids)
!!tbi&.alegre_settings&.dig('single_language_fact_checks_enabled')
end

def search_by_keywords_for_similar_published_fact_checks(words, after, team_ids, feed_id = nil, language = nil)
def search_by_keywords_for_similar_published_fact_checks(words, after, team_ids, limit, feed_id = nil, language = nil)
types = CheckSearch::MEDIA_TYPES.clone.push('blank')
search_fields = %w(title description fact_check_title fact_check_summary extracted_text url claim_description_content)
filters = { keyword: words.join('+'), keyword_fields: { fields: search_fields }, sort: 'recent_activity', eslimit: 3, show: types }
filters.merge!({ fc_language: [language] }) if should_restrict_by_language?(team_ids)
filters = { keyword: words.join('+'), keyword_fields: { fields: search_fields }, sort: 'recent_activity', eslimit: limit, show: types }
filters.merge!({ fc_language: [language] }) if !language.blank? && should_restrict_by_language?(team_ids)
filters.merge!({ sort: 'score' }) if words.size > 1 # We still want to be able to return the latest fact-checks if a meaninful query is not passed
feed_id.blank? ? filters.merge!({ report_status: ['published'] }) : filters.merge!({ feed_id: feed_id })
filters.merge!({ range: { updated_at: { start_time: after.strftime('%Y-%m-%dT%H:%M:%S.%LZ') } } }) unless after.blank?
Expand Down Expand Up @@ -304,19 +305,19 @@ def ask_for_feedback_when_all_search_results_are_received(app_id, language, work
end
end

def search_for_explainers(uid, query, team_id, language)
def search_for_explainers(uid, query, team_id, limit, language = nil)
results = nil
begin
text = ::Bot::Smooch.extract_claim(query)
if Bot::Alegre.get_number_of_words(text) == 1
results = Explainer.where(team_id: team_id).where('description ILIKE ? OR title ILIKE ?', "%#{text}%", "%#{text}%")
results = results.where(language: language) if should_restrict_by_language?([team_id])
results = results.where(language: language) if !language.nil? && should_restrict_by_language?([team_id])
results = results.order('updated_at DESC')
else
results = Explainer.search_by_similarity(text, language, team_id)
results = Explainer.search_by_similarity(text, language, team_id, limit)
end
rescue StandardError => e
self.handle_search_error(uid, e, language)
self.handle_search_error(uid, e, language) unless uid.blank?
end
results.joins(:project_medias)
end
Expand Down
41 changes: 24 additions & 17 deletions app/models/explainer.rb
Original file line number Diff line number Diff line change
@@ -1,12 +1,6 @@
class Explainer < ApplicationRecord
include Article

# FIXME: Read from workspace settings
ALEGRE_MODELS_AND_THRESHOLDS = {
# Bot::Alegre::ELASTICSEARCH_MODEL => 0.8 # Sometimes this is easier for local development
Bot::Alegre::PARAPHRASE_MULTILINGUAL_MODEL => 0.7
}

belongs_to :team

has_annotations
Expand Down Expand Up @@ -71,13 +65,14 @@ def self.update_paragraphs_in_alegre(id, previous_paragraphs_count, timestamp)
explainer_id: explainer.id
}

models_thresholds = Explainer.get_alegre_models_and_thresholds(explainer.team_id).keys
# Index title
params = {
content_hash: Bot::Alegre.content_hash_for_value(explainer.title),
doc_id: Digest::MD5.hexdigest(['explainer', explainer.id, 'title'].join(':')),
context: base_context.merge({ field: 'title' }),
text: explainer.title,
models: ALEGRE_MODELS_AND_THRESHOLDS.keys,
models: models_thresholds,
}
Bot::Alegre.index_async_with_params(params, "text")

Expand All @@ -90,7 +85,7 @@ def self.update_paragraphs_in_alegre(id, previous_paragraphs_count, timestamp)
doc_id: Digest::MD5.hexdigest(['explainer', explainer.id, 'paragraph', count].join(':')),
context: base_context.merge({ paragraph: count }),
text: paragraph.strip,
models: ALEGRE_MODELS_AND_THRESHOLDS.keys,
models: models_thresholds,
}
Bot::Alegre.index_async_with_params(params, "text")
end
Expand All @@ -107,23 +102,35 @@ def self.update_paragraphs_in_alegre(id, previous_paragraphs_count, timestamp)
end
end

def self.search_by_similarity(text, language, team_id)
def self.search_by_similarity(text, language, team_id, limit)
models_thresholds = Explainer.get_alegre_models_and_thresholds(team_id)
context = {
type: 'explainer',
team: Team.find(team_id).slug
}
context[:language] = language unless language.nil?
params = {
text: text,
models: ALEGRE_MODELS_AND_THRESHOLDS.keys,
per_model_threshold: ALEGRE_MODELS_AND_THRESHOLDS,
context: {
type: 'explainer',
team: Team.find(team_id).slug,
language: language
}
models: models_thresholds.keys,
per_model_threshold: models_thresholds,
context: context

}
response = Bot::Alegre.query_sync_with_params(params, "text")
results = response['result'].to_a.sort_by{ |result| result['_score'] }
explainer_ids = results.collect{ |result| result.dig('context', 'explainer_id').to_i }.uniq.first(3)
explainer_ids = results.collect{ |result| result.dig('context', 'explainer_id').to_i }.uniq.first(limit)
explainer_ids.empty? ? Explainer.none : Explainer.where(team_id: team_id, id: explainer_ids)
end

def self.get_alegre_models_and_thresholds(team_id)
models_thresholds = {}
Bot::Alegre.get_similarity_methods_and_models_given_media_type_and_team_id("text", team_id, true).map do |similarity_method, model_name|
_, value = Bot::Alegre.get_threshold_given_model_settings(team_id, "text", similarity_method, true, model_name)
models_thresholds[model_name] = value
end
models_thresholds
end

private

def set_team
Expand Down
16 changes: 16 additions & 0 deletions app/models/project_media.rb
Original file line number Diff line number Diff line change
Expand Up @@ -455,6 +455,22 @@ def replace_with_blank_media
self.save!
end

def get_similar_articles
# Get search query based on Media type
# Quote for Claim
# Transcription for UploadedVideo , UploadedAudio and UploadedImage
# Title and/or description for Link
media = self.media
search_query = case media.type
when 'Claim'
media.quote
when 'UploadedVideo', 'UploadedAudio', 'UploadedImage'
self.transcription
end
search_query ||= self.title
self.team.search_for_similar_articles(search_query, self)
end

protected

def add_extra_elasticsearch_data(ms)
Expand Down
24 changes: 24 additions & 0 deletions app/models/team.rb
Original file line number Diff line number Diff line change
Expand Up @@ -563,6 +563,30 @@ def filter_by_keywords(query, filters, type = 'FactCheck')
query.where(Arel.sql("#{tsvector} @@ #{tsquery}"))
end

def search_for_similar_articles(query, pm = nil)
# query: expected to be text
# pm: to request a most relevant to specific item and also include both FactCheck & Explainer
limit = pm.nil? ? CheckConfig.get('most_relevant_team_limit', 3, :integer) : CheckConfig.get('most_relevant_item_limit', 10, :integer)
result_ids = Bot::Smooch.search_for_similar_published_fact_checks_no_cache('text', query, [self.id], limit).map(&:id)
items = []
unless result_ids.blank?
# I depend on FactCheck to filter result instead of report_design
items = FactCheck.where(report_status: 'published')
.joins(claim_description: :project_media)
.where('project_medias.id': result_ids)
# Exclude the ones already applied to a target item if exsits
items = items.where.not('fact_checks.id' => pm.fact_check_id) unless pm&.fact_check_id.nil?
end
if items.blank? || !pm.nil?
# Get Explainers if no fact-check returned or get similar_articles for a ProjectMedia
ex_items = Bot::Smooch.search_for_explainers(nil, query, self.id, limit)
# Exclude the ones already applied to a target item
ex_items = ex_items.where.not(id: pm.explainer_ids) unless pm&.explainer_ids.blank?
items = items + ex_items
end
items
end

# private
#
# Please add private methods to app/models/concerns/team_private.rb
Expand Down
Loading
Loading