diff --git a/arxiv.json b/arxiv.json index 3c9a217d..a4d3485e 100644 --- a/arxiv.json +++ b/arxiv.json @@ -33794,5 +33794,40 @@ "pub_date": "2024-10-28", "summary": "As language models (LMs) become integral to fields like healthcare, law, and\njournalism, their ability to differentiate between fact, belief, and knowledge\nis essential for reliable decision-making. Failure to grasp these distinctions\ncan lead to significant consequences in areas such as medical diagnosis, legal\njudgments, and dissemination of fake news. Despite this, current literature has\nlargely focused on more complex issues such as theory of mind, overlooking more\nfundamental epistemic challenges. This study systematically evaluates the\nepistemic reasoning capabilities of modern LMs, including GPT-4, Claude-3, and\nLlama-3, using a new dataset, KaBLE, consisting of 13,000 questions across 13\ntasks. Our results reveal key limitations. First, while LMs achieve 86%\naccuracy on factual scenarios, their performance drops significantly with false\nscenarios, particularly in belief-related tasks. Second, LMs struggle with\nrecognizing and affirming personal beliefs, especially when those beliefs\ncontradict factual data, which raises concerns for applications in healthcare\nand counseling, where engaging with a person's beliefs is critical. Third, we\nidentify a salient bias in how LMs process first-person versus third-person\nbeliefs, performing better on third-person tasks (80.7%) compared to\nfirst-person tasks (54.4%). Fourth, LMs lack a robust understanding of the\nfactive nature of knowledge, namely, that knowledge inherently requires truth.\nFifth, LMs rely on linguistic cues for fact-checking and sometimes bypass the\ndeeper reasoning. These findings highlight significant concerns about current\nLMs' ability to reason about truth, belief, and knowledge while emphasizing the\nneed for advancements in these areas before broad deployment in critical\nsectors.", "translated": "随着语言模型(LM)成为医疗保健、法律和新闻等领域不可或缺的一部分,它们区分事实、信仰和知识的能力对于可靠的决策至关重要。如果不能把握这些区别,可能会在医学诊断、法律判决和传播假新闻等领域造成严重后果。尽管如此,目前的文献主要集中在更复杂的问题,如心理理论,忽视了更基本的认识挑战。这项研究系统地评估了现代 LM 的认知推理能力,包括 GPT-4,Claude-3和 Llama-3,使用一个新的数据集,KABLE,包括13个任务的13,000个问题。我们的研究结果揭示了关键的局限性。首先,当 LM 在实际情景中达到86% 的准确率时,他们在错误情景中的表现会显著下降,特别是在与信念相关的任务中。其次,LM 很难承认和肯定个人信念,尤其是当这些信念与事实数据相矛盾时,这就引起了医疗保健和咨询应用的关注,而参与到个人信念中是至关重要的。第三,我们发现 LM 如何处理第一人称与第三人称信念的显着偏差,第三人称任务(80.7%)比第一人称任务(54.4%)表现更好。第四,LM 缺乏对知识的事实性本质的有力理解,即知识本质上需要真理。第五,语言学习模式依赖语言线索进行事实核查,有时会绕过更深层次的推理。这些发现强调了当前 LM 对真理、信仰和知识进行推理的能力,同时强调了在关键部门广泛部署之前在这些领域取得进展的必要性。" + }, + { + "title": "Pushing the Performance Envelope of DNN-based Recommendation Systems\n Inference on GPUs", + "url": "http://arxiv.org/abs/2410.22249v1", + "pub_date": "2024-10-29", + "summary": "Personalized recommendation is a ubiquitous application on the internet, with\nmany industries and hyperscalers extensively leveraging Deep Learning\nRecommendation Models (DLRMs) for their personalization needs (like ad serving\nor movie suggestions). With growing model and dataset sizes pushing computation\nand memory requirements, GPUs are being increasingly preferred for executing\nDLRM inference. However, serving newer DLRMs, while meeting acceptable\nlatencies, continues to remain challenging, making traditional deployments\nincreasingly more GPU-hungry, resulting in higher inference serving costs. In\nthis paper, we show that the embedding stage continues to be the primary\nbottleneck in the GPU inference pipeline, leading up to a 3.2x embedding-only\nperformance slowdown.\n To thoroughly grasp the problem, we conduct a detailed microarchitecture\ncharacterization and highlight the presence of low occupancy in the standard\nembedding kernels. By leveraging direct compiler optimizations, we achieve\noptimal occupancy, pushing the performance by up to 53%. Yet, long memory\nlatency stalls continue to exist. To tackle this challenge, we propose\nspecialized plug-and-play-based software prefetching and L2 pinning techniques,\nwhich help in hiding and decreasing the latencies. Further, we propose\ncombining them, as they complement each other. Experimental evaluations using\nA100 GPUs with large models and datasets show that our proposed techniques\nimprove performance by up to 103% for the embedding stage, and up to 77% for\nthe overall DLRM inference pipeline.", + "translated": "个性化推荐是互联网上一个无处不在的应用程序,许多行业和超级用户广泛利用深度学习推荐模型(DLRM)来满足他们的个性化需求(如广告服务或电影推荐)。随着模型和数据集规模的不断增长,对计算和内存的要求越来越高,GPU 越来越多地被用于执行 DLRM 推理。然而,在满足可接受的延迟的同时,为新的 DLRM 提供服务仍然具有挑战性,这使得传统部署越来越依赖于 GPU,从而导致更高的推断服务成本。在本文中,我们展示了嵌入阶段仍然是 GPU 推理流水线的主要瓶颈,导致仅嵌入性能下降3.2倍。为了彻底掌握这个问题,我们进行了详细的微架构角色塑造,并强调了标准嵌入内核中存在的低占用率。通过利用直接的编译器优化,我们实现了最佳占用率,将性能提高了53% 。然而,长时间的内存延迟仍然存在。为了应对这一挑战,我们提出了专门的基于即插即用的软件预取和 L2固定技术,这有助于隐藏和减少延迟。此外,我们建议将它们结合起来,因为它们是相辅相成的。使用 A100图形处理器和大型模型和数据集的实验表明,我们提出的技术提高性能高达103% 的嵌入阶段,高达77% 的整体 DLRM 推理流水线。" + }, + { + "title": "ContextIQ: A Multimodal Expert-Based Video Retrieval System for\n Contextual Advertising", + "url": "http://arxiv.org/abs/2410.22233v1", + "pub_date": "2024-10-29", + "summary": "Contextual advertising serves ads that are aligned to the content that the\nuser is viewing. The rapid growth of video content on social platforms and\nstreaming services, along with privacy concerns, has increased the need for\ncontextual advertising. Placing the right ad in the right context creates a\nseamless and pleasant ad viewing experience, resulting in higher audience\nengagement and, ultimately, better ad monetization. From a technology\nstandpoint, effective contextual advertising requires a video retrieval system\ncapable of understanding complex video content at a very granular level.\nCurrent text-to-video retrieval models based on joint multimodal training\ndemand large datasets and computational resources, limiting their practicality\nand lacking the key functionalities required for ad ecosystem integration. We\nintroduce ContextIQ, a multimodal expert-based video retrieval system designed\nspecifically for contextual advertising. ContextIQ utilizes modality-specific\nexperts-video, audio, transcript (captions), and metadata such as objects,\nactions, emotion, etc.-to create semantically rich video representations. We\nshow that our system, without joint training, achieves better or comparable\nresults to state-of-the-art models and commercial solutions on multiple\ntext-to-video retrieval benchmarks. Our ablation studies highlight the benefits\nof leveraging multiple modalities for enhanced video retrieval accuracy instead\nof using a vision-language model alone. Furthermore, we show how video\nretrieval systems such as ContextIQ can be used for contextual advertising in\nan ad ecosystem while also addressing concerns related to brand safety and\nfiltering inappropriate content.", + "translated": "上下文广告服务的广告与用户正在浏览的内容保持一致。社交平台和流媒体服务上视频内容的快速增长,加上对隐私的担忧,增加了对上下文广告的需求。把正确的广告放在正确的上下文中,可以创造一种无缝的、愉快的广告观看体验,从而提高受众参与度,最终提高广告收入。从技术的角度来看,有效的上下文广告需要一个视频检索系统,能够理解非常细粒度的复杂视频内容。目前基于联合多模式训练的文本-视频检索模型需要大量的数据和计算资源,限制了其实用性,缺乏广告生态系统整合所需的关键功能。我们介绍了 ContextIQ,一个专门为上下文广告设计的基于多模态专家的视频检索系统。ContextIQ 利用特定于情态的专家——视频、音频、文本(标题)和元数据(如对象、动作、情感等)——创建语义丰富的视频表示。我们表明,我们的系统,没有联合培训,实现了更好或可比的结果,国家的最先进的模型和商业解决方案的多个文本到视频检索基准。我们的消融研究强调了利用多种方式提高视频检索准确性的好处,而不是仅仅使用视觉语言模型。此外,我们展示了如何视频检索系统,如 ContextIQ 可以用于广告生态系统中的上下文广告,同时也解决相关的品牌安全和过滤不适当的内容。" + }, + { + "title": "Synthetic Data Generation with Large Language Models for Personalized\n Community Question Answering", + "url": "http://arxiv.org/abs/2410.22182v1", + "pub_date": "2024-10-29", + "summary": "Personalization in Information Retrieval (IR) is a topic studied by the\nresearch community since a long time. However, there is still a lack of\ndatasets to conduct large-scale evaluations of personalized IR; this is mainly\ndue to the fact that collecting and curating high-quality user-related\ninformation requires significant costs and time investment. Furthermore, the\ncreation of datasets for Personalized IR (PIR) tasks is affected by both\nprivacy concerns and the need for accurate user-related data, which are often\nnot publicly available. Recently, researchers have started to explore the use\nof Large Language Models (LLMs) to generate synthetic datasets, which is a\npossible solution to generate data for low-resource tasks. In this paper, we\ninvestigate the potential of Large Language Models (LLMs) for generating\nsynthetic documents to train an IR system for a Personalized Community Question\nAnswering task. To study the effectiveness of IR models fine-tuned on\nLLM-generated data, we introduce a new dataset, named Sy-SE-PQA. We build\nSy-SE-PQA based on an existing dataset, SE-PQA, which consists of questions and\nanswers posted on the popular StackExchange communities. Starting from\nquestions in SE-PQA, we generate synthetic answers using different prompt\ntechniques and LLMs. Our findings suggest that LLMs have high potential in\ngenerating data tailored to users' needs. The synthetic data can replace\nhuman-written training data, even if the generated data may contain incorrect\ninformation.", + "translated": "长期以来,信息检索中的个性化是研究界研究的一个课题。然而,仍然缺乏对个性化信息检索进行大规模评价的数据集; 这主要是由于收集和管理与用户有关的高质量信息需要大量费用和时间投入。此外,个性化 IR (Personalization IR,PIR)任务的数据集的创建受到隐私问题和对精确的用户相关数据的需求的影响,这些数据通常是不公开的。最近,研究人员已经开始探索使用大语言模型(LLM)来生成合成数据集,这是一个可能的解决方案,以生成低资源任务的数据。在本文中,我们研究了大语言模型(LLM)在生成合成文档以训练个性化社区问答任务的 IR 系统方面的潜力。为了研究对 LLM 生成的数据进行微调的红外模型的有效性,我们引入了一个新的数据集,称为 Sy-SE-PQA。我们基于现有的数据集 SE-PQA 构建 Sy-SE-PQA,该数据集由发布在流行的 StackExchange 社区上的问题和答案组成。从 SE-PQA 的问题开始,我们使用不同的提示技术和 LLM 生成综合答案。我们的研究结果表明,LLM 在生成适合用户需求的数据方面具有很大的潜力。合成数据可以替代人写的训练数据,即使生成的数据可能包含不正确的信息。" + }, + { + "title": "SimRec: Mitigating the Cold-Start Problem in Sequential Recommendation\n by Integrating Item Similarity", + "url": "http://arxiv.org/abs/2410.22136v1", + "pub_date": "2024-10-29", + "summary": "Sequential recommendation systems often struggle to make predictions or take\naction when dealing with cold-start items that have limited amount of\ninteractions. In this work, we propose SimRec - a new approach to mitigate the\ncold-start problem in sequential recommendation systems. SimRec addresses this\nchallenge by leveraging the inherent similarity among items, incorporating item\nsimilarities into the training process through a customized loss function.\nImportantly, this enhancement is attained with identical model architecture and\nthe same amount of trainable parameters, resulting in the same inference time\nand requiring minimal additional effort. This novel approach results in a\nrobust contextual sequential recommendation model capable of effectively\nhandling rare items, including those that were not explicitly seen during\ntraining, thereby enhancing overall recommendation performance. Rigorous\nevaluations against multiple baselines on diverse datasets showcase SimRec's\nsuperiority, particularly in scenarios involving items occurring less than 10\ntimes in the training data. The experiments reveal an impressive improvement,\nwith SimRec achieving up to 78% higher HR@10 compared to SASRec. Notably,\nSimRec outperforms strong baselines on sparse datasets while delivering on-par\nperformance on dense datasets. Our code is available at\nhttps://github.com/amazon-science/sequential-recommendation-using-similarity.", + "translated": "在处理交互量有限的冷启动项目时,顺序推荐系统往往难以做出预测或采取行动。在这项工作中,我们提出了一种新的方法,以减轻冷启动问题的顺序推荐系统。SimRec 通过利用项目之间固有的相似性,通过定制的损失函数将项目相似性纳入培训过程,从而解决了这一挑战。重要的是,这种增强是通过相同的模型体系结构和相同数量的可训练参数来实现的,从而产生相同的推理时间,并且只需要最少的额外工作。这种新方法产生了一个健壮的上下文顺序推荐模型,能够有效地处理罕见的项目,包括那些在培训期间没有明确看到的项目,从而提高了整体推荐性能。针对不同数据集上的多个基线的严格评估展示了 SimRec 的优越性,特别是在训练数据中出现少于10次的项目的场景中。实验显示了令人印象深刻的改善,与 SASRec 相比,SimRec 的 HR@10高出78% 。值得注意的是,SimRec 在稀疏数据集上的性能优于强基线,而在密集数据集上的性能相当。我们的代码可以在 https://github.com/amazon-science/sequential-recommendation-using-similarity 找到。" + }, + { + "title": "Testing Identity of Distributions under Kolmogorov Distance in\n Polylogarithmic Space", + "url": "http://arxiv.org/abs/2410.22123v1", + "pub_date": "2024-10-29", + "summary": "Suppose we have a sample from a distribution $D$ and we want to test whether\n$D = D^*$ for a fixed distribution $D^*$. Specifically, we want to reject with\nconstant probability, if the distance of $D$ from $D^*$ is $\\geq \\varepsilon$\nin a given metric. In the case of continuous distributions, this has been\nstudied thoroughly in the statistics literature. Namely, for the well-studied\nKolmogorov metric a test is known that uses the optimal $O(1/\\varepsilon^2)$\nsamples.\n However, this test naively uses also space $O(1/\\varepsilon^2)$, and previous\nwork improved this to $O(1/\\varepsilon)$. In this paper, we show that much less\nspace suffices -- we give an algorithm that uses space $O(\\log^4\n\\varepsilon^{-1})$ in the streaming setting while also using an asymptotically\noptimal number of samples. This is in contrast with the standard total\nvariation distance on discrete distributions for which such space reduction is\nknown to be impossible. Finally, we state 9 related open problems that we hope\nwill spark interest in this and related problems.", + "translated": "假设我们有一个来自发行版 $D $的示例,我们想要测试固定发行版 $D ^ * $是否为 $D = D ^ * $。具体来说,如果 $D $从 $D ^ * $的距离是给定度量中的 $geq varepsilon $,我们希望以常数概率拒绝。在连续分布的情况下,这已经在统计文献中进行了深入的研究。也就是说,对于研究得很好的 Kolmogorov 度量,已知使用最优 $O (1/varepsilon ^ 2) $样本的测试。然而,这个测试也天真地使用空格 $O (1/varepsilon ^ 2) $,并且以前的工作将其改进为 $O (1/varepsilon) $。在本文中,我们证明了更少的空间就足够了——我们给出了一个算法,在流设置中使用空间 $O (log ^ 4 varepsilon ^ {-1}) $,同时使用渐近最优的样本数。这与离散分布上的标准总变差距形成了对比,对于这种空间缩减已知是不可能的。最后,我们陈述了9个相关的公开问题,我们希望这些问题能引起人们对这个问题和相关问题的兴趣。" } ] \ No newline at end of file