Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with dataset concatenation #43

Open
jplu opened this issue Dec 28, 2022 · 0 comments
Open

Issue with dataset concatenation #43

jplu opened this issue Dec 28, 2022 · 0 comments

Comments

@jplu
Copy link

jplu commented Dec 28, 2022

Hi,

First of all there is a bug in https://github.com/asahi417/tner/blob/master/tner/tner_cl/train.py#L118 The GridSearcher call should be:

trainer = GridSearcher(
        checkpoint_dir=opt.checkpoint_dir,
        dataset=opt.dataset,
        local_dataset=opt.local_dataset,
        dataset_name=opt.dataset_name,
        n_max_config=opt.n_max_config,
        epoch_partial=opt.epoch_partial,
        max_length_eval=opt.max_length_eval,
        dataset_split_train=opt.dataset_split_train,
        dataset_split_valid=opt.dataset_split_valid,
        model=opt.model,
        crf=opt.crf,
        max_length=opt.max_length,
        epoch=opt.epoch,
        batch_size=opt.batch_size,
        lr=opt.lr,
        random_seed=opt.random_seed,
        gradient_accumulation_steps=opt.gradient_accumulation_steps,
        weight_decay=[i if i != 0 else None for i in opt.weight_decay],
        lr_warmup_step_ratio=[i if i != 0 else None for i in opt.lr_warmup_step_ratio],
        max_grad_norm=[i if i != 0 else None for i in opt.max_grad_norm],
        use_auth_token=opt.use_auth_token
    )

The dataset_name argument was missing.

Then when I want to train a model over two different datasets they are not properly concatenated. Here a simple example to reproduce:

tner-train-search -m "xlm-roberta-base" -c "output/" -d "tner/wikiann" "tner/tweetner7" --dataset-name "ace" "tweetner7" -e 15 --epoch-partial 5 --n-max-config 3 -b 32 -g 2 4 --lr 1e-6 1e-5 --crf 0 1 --max-grad-norm 0 10 --weight-decay 0 1e-7

According to the logs we get:

encode all the data: 7111

7111 is the size of the tner/tweetner7 dataset for the split train_all. The real size should be 100 + 7111 the former being the size of the train split of the ace subdataset of tner/wikiann .

I don't know if this is an easy fix or not. I will be happy to help if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant