Train and Validation losses behavior #1822

lcoandrade · 2024-01-23T13:49:42Z

lcoandrade
Jan 23, 2024

I'm working with my students in a final graduation project that aims to extract building footprints using the Inria dataset, but not the one available through TorchGeo, but the one available in Kaggle.

We are using a CSV logger. Our hyper parameters to process for 10 epochs are (we went until 30 epochs):

EPOCHS = 10
LR = 1e-5

IN_CHANNELS = 3
IMG_SIZE = 512
MAX_WINDOWS = 100
BATCH_SIZE = 8

PATIENCE = 5
SEGMENTATION_MODEL = 'deeplabv3+' 
BACKBONE = 'resnet50'
LOSS = 'focal'

We are using a CustomGeoDataModule:

class CustomGeoDataModule(GeoDataModule):
    def setup(self, stage: str) -> None:
        """Set up datasets.

        Args:
            stage: Either 'fit', 'validate', 'test', or 'predict'.
        """
        self.dataset = self.dataset_class(**self.kwargs)
        
        generator = torch.Generator().manual_seed(0)
        (
            self.train_dataset,
            self.val_dataset,
            self.test_dataset,
        ) = random_bbox_assignment(dataset, [0.6, 0.2, 0.2], generator)
        
        if stage in ["fit"]:
            self.train_batch_sampler = RandomBatchGeoSampler(
                self.train_dataset, 
                self.patch_size, 
                self.batch_size, 
                self.length,
            )
                       
        if stage in ["fit", "validate"]:
            self.val_sampler = GridGeoSampler(
                self.val_dataset, 
                self.patch_size, 
                self.patch_size,
            )
        if stage in ["test"]:
            self.test_sampler = GridGeoSampler(
                self.test_dataset, 
                self.patch_size, 
                self.patch_size,
            )
            
datamodule = CustomGeoDataModule(
    dataset_class = type(dataset), # GeoDataModule kwargs
    batch_size = BATCH_SIZE, # GeoDataModule kwargs
    patch_size = IMG_SIZE, # GeoDataModule kwargs
    length = MAX_WINDOWS*180, # GeoDataModule kwargs. 180 because we have 180 images and we want an average of 100 samples for each image
    num_workers = WORKERS, # GeoDataModule kwargs
    dataset1 = image_set, # IntersectionDataset kwargs
    dataset2 = gt_set, # IntersectionDataset kwargs
    collate_fn = stack_samples, # IntersectionDataset kwargs
)

To calculate the train losses, we are determining the average of the steps for each epoch in the CSV log generated.
But the graphs seem odd...

Making the almost the same process with RasterVision give us a "well behaved loss graphic". What could be the reason for that?

Thanks in advance.

Answered by adamjstewart

Jan 25, 2024

Yes, the above code without normalization looks correct. Maybe try the other steps @calebrob6 suggested, including plotting the image/mask to make sure they look right.

View full answer

adamjstewart · 2024-01-24T15:03:52Z

adamjstewart
Jan 24, 2024
Maintainer

Not an answer to your question, but I'm really curious how your course went and how you structured it. If you have time, I would love to discuss this in more detail over slack (see the invite on our README) or email (see my email on https://github.com/adamjstewart).

1 reply

lcoandrade Jan 25, 2024
Author

Thanks! Let's talk over slack then.

calebrob6 · 2024-01-24T16:02:44Z

calebrob6
Jan 24, 2024
Maintainer

Hey @lcoandrade,

It looks like the loss values are super high so I would check really low level things:

To make sure inputs are being normalized correctly (divide by max val is usually fine)
To make sure that the image, mask pairs I'm passing to the model in train_step are as expected (i.e. augmentations aren't doing anything weird, the images are actually images and not all black, the masks line up)
To make sure that the learning rate is what I expect it to be

12 replies

lcoandrade Jan 25, 2024
Author

Yeah, clamp was a bad idea.... Going back to the basic division by 255.

Just to point out, I wasn't dividing the images by 255 before. It was learning, but the results were not so good and the losses graphic was weird. Now, I'm dividing by 255.

As I don't need to do this explicitly with RV, I really forgot to make the normalization with TG.

adamjstewart Jan 25, 2024
Maintainer

You may actually have been dividing by 255 before. The default augmentation added to all data modules if none is specified is to divide by 255: https://github.com/microsoft/torchgeo/blob/v0.5.1/torchgeo/datamodules/geo.py#L73. So you also don't need to do it with TG either. You can override the default mean and std if you need to.

lcoandrade Jan 25, 2024
Author

So, as I'm using a CustomGeoDataModule:

class CustomGeoDataModule(GeoDataModule):
    def setup(self, stage: str) -> None:
        """Set up datasets.

        Args:
            stage: Either 'fit', 'validate', 'test', or 'predict'.
        """
        self.dataset = self.dataset_class(**self.kwargs)
        
        generator = torch.Generator().manual_seed(0)
        (
            self.train_dataset,
            self.val_dataset,
            self.test_dataset,
        ) = random_bbox_assignment(dataset, [0.6, 0.2, 0.2], generator)
        
        if stage in ["fit"]:
            self.train_batch_sampler = RandomBatchGeoSampler(
                self.train_dataset, 
                self.patch_size, 
                self.batch_size, 
                self.length,
            )
                       
        if stage in ["fit", "validate"]:
            self.val_sampler = GridGeoSampler(
                self.val_dataset, 
                self.patch_size, 
                self.patch_size,
            )
        if stage in ["test"]:
            self.test_sampler = GridGeoSampler(
                self.test_dataset, 
                self.patch_size, 
                self.patch_size,
            )
            
datamodule = CustomGeoDataModule(
    dataset_class = type(dataset), # GeoDataModule kwargs
    batch_size = BATCH_SIZE, # GeoDataModule kwargs
    patch_size = IMG_SIZE, # GeoDataModule kwargs
    length = MAX_WINDOWS*180, # GeoDataModule kwargs
    num_workers = WORKERS, # GeoDataModule kwargs
    dataset1 = image_set, # IntersectionDataset kwargs
    dataset2 = gt_set, # IntersectionDataset kwargs
    collate_fn = stack_samples, # IntersectionDataset kwargs
)

I don't need to explicitly do this:

class ImageTransformer:
    def __call__(self, sample):
        x = sample["image"]
        x = x/255.
        sample["image"] = x
        return sample

image_set = MyRasterImage(
    paths=os.path.join(INPUT_DIR, TRAIN_DIR, IMG_DIR),
    transforms=ImageTransformer(),
)

class ReclassTransformer:
    def __call__(self, sample):
        x = sample["mask"]
        x[x == 255] = 1
        sample["mask"] = x
        return sample

gt_set = MyRasterMask(
    paths=os.path.join(INPUT_DIR, TRAIN_DIR, LABEL_DIR),
    transforms=ReclassTransformer(),
)

So the code was ok before.... Right? Why the losses were so high?

adamjstewart Jan 25, 2024
Maintainer

Yes, the above code without normalization looks correct. Maybe try the other steps @calebrob6 suggested, including plotting the image/mask to make sure they look right.

Answer selected by lcoandrade

lcoandrade Jan 25, 2024
Author

I'm plotting the pairs. They are fine.

The predictions also look ok, considering the amount of epochs.

Plotting with seaborn relplot, I'm getting this (just 5 epochs for the time sake):

adamjstewart Jan 26, 2024
Maintainer

Before you were getting 1e9, now you're getting 1e-7? That seems much better to me.

calebrob6 Jan 26, 2024
Maintainer

+1, I wouldn't be too concerned if I saw a plot like the above. What do the predictions of the model look like after 5 epochs? And what happens if you use a smaller LR?

lcoandrade Jan 26, 2024
Author

Yeah. In fact, it is much better in this processing.

I don't know if it is related, but this last 5 epoch processing didn't generate ERROR 1: Point outside of projection domain errors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train and Validation losses behavior #1822

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Train and Validation losses behavior #1822

lcoandrade Jan 23, 2024

Replies: 2 comments · 13 replies

adamjstewart Jan 24, 2024 Maintainer

lcoandrade Jan 25, 2024 Author

calebrob6 Jan 24, 2024 Maintainer

lcoandrade Jan 25, 2024 Author

adamjstewart Jan 25, 2024 Maintainer

lcoandrade Jan 25, 2024 Author

adamjstewart Jan 25, 2024 Maintainer

lcoandrade Jan 25, 2024 Author

adamjstewart Jan 26, 2024 Maintainer

calebrob6 Jan 26, 2024 Maintainer

lcoandrade Jan 26, 2024 Author

lcoandrade
Jan 23, 2024

Replies: 2 comments 13 replies

adamjstewart
Jan 24, 2024
Maintainer

lcoandrade Jan 25, 2024
Author

calebrob6
Jan 24, 2024
Maintainer

lcoandrade Jan 25, 2024
Author

adamjstewart Jan 25, 2024
Maintainer

lcoandrade Jan 25, 2024
Author

adamjstewart Jan 25, 2024
Maintainer

lcoandrade Jan 25, 2024
Author

adamjstewart Jan 26, 2024
Maintainer

calebrob6 Jan 26, 2024
Maintainer

lcoandrade Jan 26, 2024
Author