Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to adapt validation_run from Ch10 for Categorical DQN? #54

Open
PeterSenyszyn opened this issue Feb 3, 2022 · 0 comments
Open

How to adapt validation_run from Ch10 for Categorical DQN? #54

PeterSenyszyn opened this issue Feb 3, 2022 · 0 comments

Comments

@PeterSenyszyn
Copy link

PeterSenyszyn commented Feb 3, 2022

Hello, I'm trying to adapt the examples from Ch 8 & 10 from the book into a Double-Dueling Categorical architecture using Conv1d from Ch 10. Training seems to work fine using ptan and Pytorch ignite. I want to run validation though using openai gym, so I was wondering how to determine the next action for a new observation batch. My understanding is that for the normal dueling/double Q Conv1d network we run a forward pass of the observation through the trained network for the Q values, which we maximize to find action_idx. When running an observation through the categorical architecture however the book states a forward pass "returns the predicted probability distribution as a 3D tensor (batch, actions, and supports)." For a bar size of 10 I see clearly in my output that I get a (1,3,51) shaped tensor. But dim=1 looks to be various weights, not integers. What additional steps do I need to take in order to get the next step to take for the openai gym? Thanks in advance, and happy to post more code if needed.

My model:

class PlatformDQNDistr(nn.Module):
      def __init__(self, input_shape, n_actions):
        super(PlatformDQNDistr, self).__init__()

        self.conv = nn.Sequential(
            nn.Conv1d(input_shape[0], 128, 5),
            nn.ReLU(),
            nn.Conv1d(128, 128, 5),
            nn.ReLU(),
        )
        conv_out_size = self._get_conv_out(input_shape)

        # We use Noisy networks rather than epsilon greedy action selection for exploration
        self.fc_val = nn.Sequential(
            NoisyFactorizedLinear(conv_out_size, 512),
            nn.ReLU(),
            NoisyFactorizedLinear(512, 1)
        )

        self.fc_adv = nn.Sequential(
            NoisyFactorizedLinear(conv_out_size, 512),
            nn.ReLU(),
            NoisyFactorizedLinear(512, n_actions * N_ATOMS)
        )
        sups = torch.arange(Vmin, Vmax + DELTA_Z, DELTA_Z)
        self.register_buffer("supports", sups)
        self.softmax = nn.Softmax(dim=1)

    def _get_conv_out(self, shape):
        o = self.conv(torch.zeros(1, *shape))
        return int(np.prod(o.size()))

    def forward(self, x):
        batch_size = x.size()[0]
        conv_out = self.conv(x).view(batch_size, -1)  # convolve batch
        val = self.fc_val(conv_out)
        adv = self.fc_adv(conv_out)
        return (val + adv - adv.mean(dim=1, keepdim=True)).view(batch_size, -1, N_ATOMS)

    def both(self, x):
        cat_out = self(x)
        probs_distribution = self.apply_softmax(cat_out)
        weights = probs_distribution * self.supports
        res = weights.sum(dim=2)
        return cat_out, res

    def q_vals(self, x):
        return self.both(x)[1]

    def apply_softmax(self, t):
        return self.softmax(t.view(-1, N_ATOMS)).view(t.size())
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant