Skip to content

Commit

Permalink
CV2-5373: Most relevant articles (#2130)
Browse files Browse the repository at this point in the history
* CV2-5373: add a new graphql fields  &

* CV2-5373: update relay files

* CV2-5730: return dummy relevant articles (#2136)

* CV2-5731 Refactoring smooch search (#2137)

* CV2-5731: call tipline search_for_articles method and append Explainers if no articles exists

* CV2-5731: fix graphql query and add more tests

* CV2-5731: apply PR comments and add more tests

* CV2-5761 list most relevant articles fact check and explainer for project media item (#2142)

* CV2-5761: include FactCheck & Explainer for item most relevant

* CV2-5751: fix articles sort and change the limit

* CV2-5761: keep default sort (sort by score)

* CV2-5761: enforce limit value as a method args

* CV2-5761: cleanup

* CV2-5761: add more tests

* CV2-5761: add missing test to back coverage 100%

* CV2-5761: apply PR comments

* CV2-5373: check language exists for fc_language condition

* CV2-5617: use workspace similarity settings for explainers (#2145)

* CV2-5617: use workspace similarity settings for explainers

* Refactor threshold getters

* CV2-5617: fix tests

* CV2-5617: return models_and_thresholds in Hash formatt

---------

Co-authored-by: Devin Gaffney <[email protected]>

* Add Helper Method to Create and Publish Standalone Fact Checks for Check Web Testing (#2151)

* Add new test helper method for creating standalone and published fact check

- create_imported_standalone_fact_check method in TestController to create a standalone and published fact check and associate it with a team.
- Updated `routes.rb` to include the new endpoint for fact checks.
- Added tests for the create_imported_standalone_fact_check

Reference: CV2-5737

* CV2-5373: fix tests

---------

Co-authored-by: Devin Gaffney <[email protected]>
Co-authored-by: Daniele Valverde <[email protected]>
  • Loading branch information
3 people authored Dec 11, 2024
1 parent 1f4672f commit 5020cbc
Show file tree
Hide file tree
Showing 21 changed files with 439 additions and 84 deletions.
33 changes: 33 additions & 0 deletions app/controllers/test_controller.rb
Original file line number Diff line number Diff line change
Expand Up @@ -225,6 +225,39 @@ def suggest_similarity_item
render_success 'suggest_similarity', pm1
end

def create_imported_standalone_fact_check
team = Team.current = Team.find(params[:team_id])
user = User.where(email: params[:email]).last
description = params[:description]
context = params[:context]
title = params[:title]
summary = params[:summary]
url = params[:url]
language = params[:language] || 'en'

# Create ClaimDescription
claim_description = ClaimDescription.create!(
description: description,
context: context,
user: user,
team: team
)

# Set up FactCheck
fact_check = FactCheck.new(
claim_description: claim_description,
title: title,
summary: summary,
url: url,
language: language,
user: user,
publish_report: true,
report_status: 'published'
)
fact_check.save!
render_success 'fact_check', fact_check
end

def random
render html: "<!doctype html><html><head><title>Test #{rand(100000).to_i}</title></head><body>Test</body></html>".html_safe
end
Expand Down
12 changes: 12 additions & 0 deletions app/graph/types/project_media_type.rb
Original file line number Diff line number Diff line change
Expand Up @@ -394,4 +394,16 @@ def articles_count
count += 1 if object.fact_check
count
end

field :relevant_articles, ::ArticleUnion.connection_type, null: true

def relevant_articles
object.get_similar_articles
end

field :relevant_articles_count, GraphQL::Types::Int, null: true

def relevant_articles_count
object.get_similar_articles.count
end
end
21 changes: 15 additions & 6 deletions app/models/bot/alegre.rb
Original file line number Diff line number Diff line change
Expand Up @@ -256,11 +256,12 @@ def self.merge_suggested_and_confirmed(suggested_or_confirmed, confirmed, pm)
end
end

def self.get_matching_key_value(pm, media_type, similarity_method, automatic, model_name)
def self.get_threshold_given_model_settings(team_id, media_type, similarity_method, automatic, model_name)
tbi = nil
tbi = self.get_alegre_tbi(team_id) unless team_id.nil?
similarity_level = automatic ? 'matching' : 'suggestion'
generic_key = "#{media_type}_#{similarity_method}_#{similarity_level}_threshold"
specific_key = "#{media_type}_#{similarity_method}_#{model_name}_#{similarity_level}_threshold"
tbi = self.get_alegre_tbi(pm&.team_id)
settings = tbi.alegre_settings unless tbi.nil?
outkey = ""
value = nil
Expand All @@ -274,17 +275,25 @@ def self.get_matching_key_value(pm, media_type, similarity_method, automatic, mo
return [outkey, value]
end

def self.get_threshold_for_query(media_type, pm, automatic = false)
def self.get_matching_key_value(pm, media_type, similarity_method, automatic, model_name)
self.get_threshold_given_model_settings(pm&.team_id, media_type, similarity_method, automatic, model_name)
end

def self.get_similarity_methods_and_models_given_media_type_and_team_id(media_type, team_id, get_vector_settings)
similarity_methods = media_type == 'text' ? ['elasticsearch'] : ['hash']
models = similarity_methods.dup
if media_type == 'text' && !pm.nil?
models_to_use = [self.matching_model_to_use(pm.team_id)].flatten-[Bot::Alegre::ELASTICSEARCH_MODEL]
if media_type == 'text' && get_vector_settings
models_to_use = [self.matching_model_to_use(team_id)].flatten-[Bot::Alegre::ELASTICSEARCH_MODEL]
models_to_use.each do |model|
similarity_methods << 'vector'
models << model
end
end
similarity_methods.zip(models).collect do |similarity_method, model_name|
return similarity_methods.zip(models)
end

def self.get_threshold_for_query(media_type, pm, automatic = false)
self.get_similarity_methods_and_models_given_media_type_and_team_id(media_type, pm&.team_id, !pm.nil?).collect do |similarity_method, model_name|
key, value = self.get_matching_key_value(pm, media_type, similarity_method, automatic, model_name)
{ value: value.to_f, key: key, automatic: automatic, model: model_name}
end
Expand Down
45 changes: 23 additions & 22 deletions app/models/concerns/smooch_search.rb
Original file line number Diff line number Diff line change
Expand Up @@ -9,21 +9,22 @@ module ClassMethods
def search(app_id, uid, language, message, team_id, workflow, provider = nil)
platform = self.get_platform_from_message(message)
begin
limit = CheckConfig.get('most_relevant_team_limit', 3, :integer)
sm = CheckStateMachine.new(uid)
self.get_installation(self.installation_setting_id_keys, app_id) if self.config.blank?
RequestStore.store[:smooch_bot_provider] = provider unless provider.blank?
query = self.get_search_query(uid, message)
results = self.get_search_results(uid, query, team_id, language).collect{ |pm| Relationship.confirmed_parent(pm) }.uniq
results = self.get_search_results(uid, query, team_id, language, limit).collect{ |pm| Relationship.confirmed_parent(pm) }.uniq
reports = results.select{ |pm| pm.report_status == 'published' }.collect{ |pm| pm.get_dynamic_annotation('report_design') }.reject{ |r| r.nil? }.collect{ |r| r.report_design_to_tipline_search_result }.select{ |r| r.should_send_in_language?(language) }

# Extract explainers from matched media if they don't have published fact-checks but they have explainers
reports = results.collect{ |pm| pm.explainers.to_a }.flatten.uniq.first(3).map(&:as_tipline_search_result) if !results.empty? && reports.empty?
reports = results.collect{ |pm| pm.explainers.to_a }.flatten.uniq.first(limit).map(&:as_tipline_search_result) if !results.empty? && reports.empty?

# Search for explainers if fact-checks were not found
if reports.empty? && query['type'] == 'text'
explainers = self.search_for_explainers(uid, query['text'], team_id, language).first(3).select{ |explainer| explainer.as_tipline_search_result.should_send_in_language?(language) }
explainers = self.search_for_explainers(uid, query['text'], team_id, limit, language).select{ |explainer| explainer.as_tipline_search_result.should_send_in_language?(language) }
Rails.logger.info "[Smooch Bot] Text similarity search got #{explainers.count} explainers while looking for '#{query['text']}' for team #{team_id}"
results = explainers.collect{ |explainer| explainer.project_medias.to_a }.flatten.uniq.reject{ |pm| pm.blank? }.first(3)
results = explainers.collect{ |explainer| explainer.project_medias.to_a }.flatten.uniq.reject{ |pm| pm.blank? }.first(limit)
reports = explainers.map(&:as_tipline_search_result)
end

Expand Down Expand Up @@ -100,9 +101,9 @@ def reject_temporary_results(results)
end
end

def parse_search_results_from_alegre(results, after = nil, feed_id = nil, team_ids = nil)
def parse_search_results_from_alegre(results, limit, after = nil, feed_id = nil, team_ids = nil)
pms = reject_temporary_results(results).sort_by{ |a| [a[1][:model] != Bot::Alegre::ELASTICSEARCH_MODEL ? 1 : 0, a[1][:score]] }.to_h.keys.reverse.collect{ |id| Relationship.confirmed_parent(ProjectMedia.find_by_id(id)) }
filter_search_results(pms, after, feed_id, team_ids).uniq(&:id).sort_by{ |pm| pm.report_status == 'published' ? 0 : 1 }.first(3)
filter_search_results(pms, after, feed_id, team_ids).uniq(&:id).first(limit)
end

def date_filter(team_id)
Expand All @@ -127,14 +128,14 @@ def get_search_query(uid, last_message)
self.bundle_list_of_messages(list, last_message, true)
end

def get_search_results(uid, message, team_id, language)
def get_search_results(uid, message, team_id, language, limit)
results = []
begin
type = message['type']
after = self.date_filter(team_id)
query = message['text']
query = CheckS3.rewrite_url(message['mediaUrl']) unless type == 'text'
results = self.search_for_similar_published_fact_checks(type, query, [team_id], after, nil, language).select{ |pm| is_a_valid_search_result(pm) }
results = self.search_for_similar_published_fact_checks(type, query, [team_id], limit, after, nil, language).select{ |pm| is_a_valid_search_result(pm) }
rescue StandardError => e
self.handle_search_error(uid, e, language)
end
Expand All @@ -148,19 +149,19 @@ def normalized_query_hash(type, query, team_ids, after, feed_id, language)

# "type" is text, video, audio or image
# "query" is either a piece of text of a media URL
def search_for_similar_published_fact_checks(type, query, team_ids, after = nil, feed_id = nil, language = nil, skip_cache = false)
def search_for_similar_published_fact_checks(type, query, team_ids, limit, after = nil, feed_id = nil, language = nil, skip_cache = false)
if skip_cache
self.search_for_similar_published_fact_checks_no_cache(type, query, team_ids, after, feed_id, language)
self.search_for_similar_published_fact_checks_no_cache(type, query, team_ids, limit, after, feed_id, language)
else
Rails.cache.fetch("smooch:search_results:#{self.normalized_query_hash(type, query, team_ids, after, feed_id, language)}", expires_in: 2.hours) do
self.search_for_similar_published_fact_checks_no_cache(type, query, team_ids, after, feed_id, language)
self.search_for_similar_published_fact_checks_no_cache(type, query, team_ids, limit, after, feed_id, language)
end
end
end

# "type" is text, video, audio or image
# "query" is either a piece of text of a media URL
def search_for_similar_published_fact_checks_no_cache(type, query, team_ids, after = nil, feed_id = nil, language = nil)
def search_for_similar_published_fact_checks_no_cache(type, query, team_ids, limit, after = nil, feed_id = nil, language = nil)
results = []
pm = nil
pm = ProjectMedia.new(team_id: team_ids[0]) if team_ids.size == 1 # We'll use the settings of a team instead of global settings when there is only one team
Expand All @@ -179,10 +180,10 @@ def search_for_similar_published_fact_checks_no_cache(type, query, team_ids, aft
words = text.split(/\s+/)
Rails.logger.info "[Smooch Bot] Search query (text): #{text}"
if Bot::Alegre.get_number_of_words(text) <= self.max_number_of_words_for_keyword_search
results = self.search_by_keywords_for_similar_published_fact_checks(words, after, team_ids, feed_id, language)
results = self.search_by_keywords_for_similar_published_fact_checks(words, after, team_ids, limit, feed_id, language)
else
alegre_results = Bot::Alegre.get_merged_similar_items(pm, [{ value: self.get_text_similarity_threshold }], Bot::Alegre::ALL_TEXT_SIMILARITY_FIELDS, text, team_ids)
results = self.parse_search_results_from_alegre(alegre_results, after, feed_id, team_ids)
results = self.parse_search_results_from_alegre(alegre_results, limit, after, feed_id, team_ids)
Rails.logger.info "[Smooch Bot] Text similarity search got #{results.count} results while looking for '#{text}' after date #{after.inspect} for teams #{team_ids}"
end
else
Expand All @@ -192,7 +193,7 @@ def search_for_similar_published_fact_checks_no_cache(type, query, team_ids, aft
media_url = self.save_locally_and_return_url(media_url, type, feed_id)
threshold = Bot::Alegre.get_threshold_for_query(type, pm)[0][:value]
alegre_results = Bot::Alegre.get_items_with_similar_media_v2(media_url: media_url, threshold: [{ value: threshold }], team_ids: team_ids, type: type)
results = self.parse_search_results_from_alegre(alegre_results, after, feed_id, team_ids)
results = self.parse_search_results_from_alegre(alegre_results, limit, after, feed_id, team_ids)
Rails.logger.info "[Smooch Bot] Media similarity search got #{results.count} results while looking for '#{query}' after date #{after.inspect} for teams #{team_ids}"
end
results
Expand Down Expand Up @@ -245,11 +246,11 @@ def should_restrict_by_language?(team_ids)
!!tbi&.alegre_settings&.dig('single_language_fact_checks_enabled')
end

def search_by_keywords_for_similar_published_fact_checks(words, after, team_ids, feed_id = nil, language = nil)
def search_by_keywords_for_similar_published_fact_checks(words, after, team_ids, limit, feed_id = nil, language = nil)
types = CheckSearch::MEDIA_TYPES.clone.push('blank')
search_fields = %w(title description fact_check_title fact_check_summary extracted_text url claim_description_content)
filters = { keyword: words.join('+'), keyword_fields: { fields: search_fields }, sort: 'recent_activity', eslimit: 3, show: types }
filters.merge!({ fc_language: [language] }) if should_restrict_by_language?(team_ids)
filters = { keyword: words.join('+'), keyword_fields: { fields: search_fields }, sort: 'recent_activity', eslimit: limit, show: types }
filters.merge!({ fc_language: [language] }) if !language.blank? && should_restrict_by_language?(team_ids)
filters.merge!({ sort: 'score' }) if words.size > 1 # We still want to be able to return the latest fact-checks if a meaninful query is not passed
feed_id.blank? ? filters.merge!({ report_status: ['published'] }) : filters.merge!({ feed_id: feed_id })
filters.merge!({ range: { updated_at: { start_time: after.strftime('%Y-%m-%dT%H:%M:%S.%LZ') } } }) unless after.blank?
Expand Down Expand Up @@ -304,19 +305,19 @@ def ask_for_feedback_when_all_search_results_are_received(app_id, language, work
end
end

def search_for_explainers(uid, query, team_id, language)
def search_for_explainers(uid, query, team_id, limit, language = nil)
results = nil
begin
text = ::Bot::Smooch.extract_claim(query)
if Bot::Alegre.get_number_of_words(text) == 1
results = Explainer.where(team_id: team_id).where('description ILIKE ? OR title ILIKE ?', "%#{text}%", "%#{text}%")
results = results.where(language: language) if should_restrict_by_language?([team_id])
results = results.where(language: language) if !language.nil? && should_restrict_by_language?([team_id])
results = results.order('updated_at DESC')
else
results = Explainer.search_by_similarity(text, language, team_id)
results = Explainer.search_by_similarity(text, language, team_id, limit)
end
rescue StandardError => e
self.handle_search_error(uid, e, language)
self.handle_search_error(uid, e, language) unless uid.blank?
end
results.joins(:project_medias)
end
Expand Down
41 changes: 24 additions & 17 deletions app/models/explainer.rb
Original file line number Diff line number Diff line change
@@ -1,12 +1,6 @@
class Explainer < ApplicationRecord
include Article

# FIXME: Read from workspace settings
ALEGRE_MODELS_AND_THRESHOLDS = {
# Bot::Alegre::ELASTICSEARCH_MODEL => 0.8 # Sometimes this is easier for local development
Bot::Alegre::PARAPHRASE_MULTILINGUAL_MODEL => 0.7
}

belongs_to :team

has_annotations
Expand Down Expand Up @@ -71,13 +65,14 @@ def self.update_paragraphs_in_alegre(id, previous_paragraphs_count, timestamp)
explainer_id: explainer.id
}

models_thresholds = Explainer.get_alegre_models_and_thresholds(explainer.team_id).keys
# Index title
params = {
content_hash: Bot::Alegre.content_hash_for_value(explainer.title),
doc_id: Digest::MD5.hexdigest(['explainer', explainer.id, 'title'].join(':')),
context: base_context.merge({ field: 'title' }),
text: explainer.title,
models: ALEGRE_MODELS_AND_THRESHOLDS.keys,
models: models_thresholds,
}
Bot::Alegre.index_async_with_params(params, "text")

Expand All @@ -90,7 +85,7 @@ def self.update_paragraphs_in_alegre(id, previous_paragraphs_count, timestamp)
doc_id: Digest::MD5.hexdigest(['explainer', explainer.id, 'paragraph', count].join(':')),
context: base_context.merge({ paragraph: count }),
text: paragraph.strip,
models: ALEGRE_MODELS_AND_THRESHOLDS.keys,
models: models_thresholds,
}
Bot::Alegre.index_async_with_params(params, "text")
end
Expand All @@ -107,23 +102,35 @@ def self.update_paragraphs_in_alegre(id, previous_paragraphs_count, timestamp)
end
end

def self.search_by_similarity(text, language, team_id)
def self.search_by_similarity(text, language, team_id, limit)
models_thresholds = Explainer.get_alegre_models_and_thresholds(team_id)
context = {
type: 'explainer',
team: Team.find(team_id).slug
}
context[:language] = language unless language.nil?
params = {
text: text,
models: ALEGRE_MODELS_AND_THRESHOLDS.keys,
per_model_threshold: ALEGRE_MODELS_AND_THRESHOLDS,
context: {
type: 'explainer',
team: Team.find(team_id).slug,
language: language
}
models: models_thresholds.keys,
per_model_threshold: models_thresholds,
context: context

}
response = Bot::Alegre.query_sync_with_params(params, "text")
results = response['result'].to_a.sort_by{ |result| result['_score'] }
explainer_ids = results.collect{ |result| result.dig('context', 'explainer_id').to_i }.uniq.first(3)
explainer_ids = results.collect{ |result| result.dig('context', 'explainer_id').to_i }.uniq.first(limit)
explainer_ids.empty? ? Explainer.none : Explainer.where(team_id: team_id, id: explainer_ids)
end

def self.get_alegre_models_and_thresholds(team_id)
models_thresholds = {}
Bot::Alegre.get_similarity_methods_and_models_given_media_type_and_team_id("text", team_id, true).map do |similarity_method, model_name|
_, value = Bot::Alegre.get_threshold_given_model_settings(team_id, "text", similarity_method, true, model_name)
models_thresholds[model_name] = value
end
models_thresholds
end

private

def set_team
Expand Down
16 changes: 16 additions & 0 deletions app/models/project_media.rb
Original file line number Diff line number Diff line change
Expand Up @@ -455,6 +455,22 @@ def replace_with_blank_media
self.save!
end

def get_similar_articles
# Get search query based on Media type
# Quote for Claim
# Transcription for UploadedVideo , UploadedAudio and UploadedImage
# Title and/or description for Link
media = self.media
search_query = case media.type
when 'Claim'
media.quote
when 'UploadedVideo', 'UploadedAudio', 'UploadedImage'
self.transcription
end
search_query ||= self.title
self.team.search_for_similar_articles(search_query, self)
end

protected

def add_extra_elasticsearch_data(ms)
Expand Down
24 changes: 24 additions & 0 deletions app/models/team.rb
Original file line number Diff line number Diff line change
Expand Up @@ -563,6 +563,30 @@ def filter_by_keywords(query, filters, type = 'FactCheck')
query.where(Arel.sql("#{tsvector} @@ #{tsquery}"))
end

def search_for_similar_articles(query, pm = nil)
# query: expected to be text
# pm: to request a most relevant to specific item and also include both FactCheck & Explainer
limit = pm.nil? ? CheckConfig.get('most_relevant_team_limit', 3, :integer) : CheckConfig.get('most_relevant_item_limit', 10, :integer)
result_ids = Bot::Smooch.search_for_similar_published_fact_checks_no_cache('text', query, [self.id], limit).map(&:id)
items = []
unless result_ids.blank?
# I depend on FactCheck to filter result instead of report_design
items = FactCheck.where(report_status: 'published')
.joins(claim_description: :project_media)
.where('project_medias.id': result_ids)
# Exclude the ones already applied to a target item if exsits
items = items.where.not('fact_checks.id' => pm.fact_check_id) unless pm&.fact_check_id.nil?
end
if items.blank? || !pm.nil?
# Get Explainers if no fact-check returned or get similar_articles for a ProjectMedia
ex_items = Bot::Smooch.search_for_explainers(nil, query, self.id, limit)
# Exclude the ones already applied to a target item
ex_items = ex_items.where.not(id: pm.explainer_ids) unless pm&.explainer_ids.blank?
items = items + ex_items
end
items
end

# private
#
# Please add private methods to app/models/concerns/team_private.rb
Expand Down
Loading

0 comments on commit 5020cbc

Please sign in to comment.