Skip to content

Repository for collecting and categorizing papers outlined in our survey paper: "Large Language Models on Tabular Data -- A Survey".

Notifications You must be signed in to change notification settings

bird-bench/LLM-on-Tabular-Data-Prediction-Table-Understanding-Data-Generation

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

Large Language Models on Tabular Data -- A Survey

@misc{fang2024large,
      title={Large Language Models on Tabular Data -- A Survey}, 
      author={Xi Fang and Weijie Xu and Fiona Anting Tan and Jiani Zhang and Ziqing Hu and Yanjun Qi and Scott Nickleach and Diego Socolinsky and Srinivasan Sengamedu and Christos Faloutsos},
      year={2024},
      eprint={2402.17944},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Original paper

LLM on Tabular Data Prediction and Understanding -- A Survey

This repo is constructed for collecting and categorizing papers about diffusion models according to our survey paper——Large Language Models on Tabular Data -- A Survey. Considering the fast development of this field, we will continue to update both arxiv paper and this repo.

Abstract
Recent breakthroughs in large language modeling have facilitated rigorous exploration of their application in diverse tasks related to tabular data modeling, such as prediction, tabular data synthesis, question answering, and table understanding. Each task presents unique challenges and opportunities. However, there is currently a lack of comprehensive review that summarizes and compares the key techniques, metrics, datasets, models, and optimization approaches in this research domain. This survey aims to address this gap by consolidating recent progress in these areas, offering a thorough survey and taxonomy of the datasets, metrics, and methodologies utilized. It identifies strengths, limitations, unexplored territories, and gaps in the existing literature, while providing some insights for future research directions in this vital and rapidly evolving field. It also provides relevant code and datasets references. Through this comprehensive review, we hope to provide interested readers with pertinent references and insightful perspectives, empowering them with the necessary tools and knowledge to effectively navigate and address the prevailing challenges in the field.

336529724-fdd847f0-f232-474c-aaac-bc8232a42547 Figure 1: Overview of LLM on Tabular Data: the paper discusses application of LLM for prediction, data generation, and table understanding tasks.

LLMs_x_TabularData_KeyTechniques Figure 4: Key techniques in using LLMs for tabular data. The dotted line indicates steps that are optional.

Table of content:

Taxonomy

Prediction task


Tabular Data

TABLET: Learning From Instructions For Tabular Data [code]

Language models are weak learners

LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning Tasks
[code]

TabLLM: Few-shot Classification of Tabular Data with Large Language Models
[code]

UniPredict: Large Language Models are Universal Tabular Classifiers

Towards Foundation Models for Learning on Tabular Data

Towards Better Serialization of Tabular Data for Few-shot Classification with Large Language Models

Time series

PromptCast: A New Prompt-based Learning Paradigm for Time Series Forecasting

Large Language Models Are Zero-Shot Time Series Forecasters

TEST: Text Prototype Aligned Embedding to Activate LLM's Ability for Time Series

Time-LLM: Time Series Forecasting by Reprogramming Large Language Models
[code]

Application Specific

MediTab: Scaling Medical Tabular Data Predictors via Data Consolidation, Enrichment, and Refinement
[code]

CPLLM: Clinical Prediction with Large Language Models
[code]

SERVAL : Synergy Learning between Vertical Models and LLMs towards Oracle-Level Zero-shot Medical Prediction

CTRL: Connect Collaborative and Language Model for CTR Prediction

FinGPT: Open-Source Financial Large Language Models
[code]

Data Generation task


Language Models are Realistic Tabular Data Generators [code]

REaLTabFormer: Generating Realistic Relational and Tabular Data using Transformers

Generative Table Pre-training Empowers Models for Tabular Prediction [code]

TabuLa: Harnessing Language Models for Tabular Data Synthesis [code]

Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in ultra low-data regimes

TabMT: Generating tabular data with masked transformers

Elephants Never Forget: Testing Language Models for Memorization of Tabular Data

Table understanding


Numeric Question Answering

DocMath-Eval: Evaluating Numerical Reasoning Capabilities of LLMs in Understanding Long Documents with Tabular Data

Exploring the Numerical Reasoning Capabilities of Language Models: A Comprehensive Analysis on Tabular Data

TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT

Question Answering

Large Language Models are Versatile Decomposers: Decompose Evidence and Questions for Table-based Reasoning [code]

PACIFIC: Towards Proactive Conversational Question Answering over Tabular and Textual Data in Finance [code]

Large Language Models are few(1)-shot Table Reasoners [code]

cTBLS: Augmenting Large Language Models with Conversational Tables [code]

Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study

Large Language Models are Complex Table Parsers

Rethinking Tabular Data Understanding with Large Language Models [code]

TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT

Testing the Limits of Unified Sequence to Sequence LLM Pretraining on Diverse Table Data Tasks

Unified Language Representation for Question Answering over Text, Tables, and Images

TableLlama: Towards Open Large Generalist Models for Tables [code]

DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain Question Answering over Knowledge Base and Text

StructGPT: A General Framework for Large Language Model to Reason over Structured Data [code]

JarviX: A LLM No code Platform for Tabular Data Analysis and Optimization

Text2SQL

Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation

DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction [code]

C3: Zero-shot Text-to-SQL with ChatGPT [code]

Bridging the Gap: Deciphering Tabular Data Using Large Language Model

TableQuery: Querying tabular data with natural language [code]

Datasets

Please refer to our paper to see relevant methods that benchmark on these datasets.

Prediction Tasks

Dataset Dataset Number Dataset Repo
OpenML 11 https://github.com/UW-Madison-Lee-Lab/LanguageInterfacedFineTuning/tree/master/regression/realdata/data
Kaggle API 169 https://github.com/Kaggle/kaggle-api
Combo 9 https://github.com/clinicalml/TabLLM/tree/main/datasets
UCI ML 20 https://github.com/dylan-slack/Tablet/tree/main/data/benchmark/performance
DDX 10 https://github.com/dylan-slack/Tablet/tree/main/data/ddx_data_no_instructions/benchmark

Table Understanding Tasks

Dataset # Tables Task Type Input Output Data Source Dataset Repo
FetaQA 10330 QA Table Question Answer Wikipedia https://github.com/Yale-LILY/FeTaQA
WikiTableQuestion 2108 QA Table Question Answer Wikipedia https://ppasupat.github.io/WikiTableQuestions/
NQ-TABLES 169898 QA Question, Table Answer Synthetic https://github.com/google-research-datasets/natural-questions
HybriDialogue 13000 QA Conversation, Table, Reference Answer Wikipedia https://github.com/entitize/HybridDialogue
TAT-QA 2757 QA Question, Table Answer Financial report https://github.com/NExTplusplus/TAT-QA
HiTAB 3597 QA/NLG Question, Table Answer Statistical Report and Wikipedia https://github.com/microsoft/HiTab
ToTTo 120000 NLG Table Sentence Wikipedia https://github.com/google-research-datasets/ToTTo
FEVEROUS 28800 Classification Claim, Table Label Common Crawl https://fever.ai/dataset/feverous.html
Dresden Web Tables 125M Classification Table Label Common Crawl https://ppasupat.github.io/WikiTableQuestions/
InfoTabs 2540 NLI Table , Hypothesis Label Wikipedia https://infotabs.github.io/
TabFact 16573 NLI Table, Statement Label Wikipedia https://tabfact.github.io/
TAPEX 1500 Text2SQL SQL, Table Answer Synthetic https://github.com/google-research/tapas
Spider 1020 Text2SQL Table, Question SQL Human annotation https://drive.usercontent.google.com/download?id=1iRDVHLr4mX2wQKSgA9J8Pire73Jahh0m&export=download&authuser=0
WIKISQL 24241 Text2SQL Table, Question SQL, Answer Human Annotated https://github.com/salesforce/WikiSQL

Contributing

If you would like to contribute to this list or writeup, feel free to submit a pull request!

About

Repository for collecting and categorizing papers outlined in our survey paper: "Large Language Models on Tabular Data -- A Survey".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published