`np.random.choice`

, but OpenAI's implementation was doing it a completely different way. Looking at utils.py, we have:

```
def sample(logits):
noise = tf.random_uniform(tf.shape(logits))
return tf.argmax(logits - tf.log(-tf.log(noise)), 1)
```

This *sort of* makes sense. If you just take the argmax of the logits, you would always just sample the highest-probability action. So instead, you add some noise to the logits to make things more random, and *then* take the argmax. But why not just use `noise`

? Why `-tf.log(-tf.log(noise))`

?

There's another strange thing. Usually we sample from the distribution created by passing the logits through a softmax function. If we sample using noise on only the logits, does it give the same results?

It turns out this is a clever way of sampling *directly* from softmax
distribution using noise from a special distribution: the *Gumbel*
distribution. Let's explore what this humble distribution is all about.

(If you'd like to follow along with this notebook interactively, sources can be found at GitHub.)

The Gumbel distribution is a probability distribution with density function

$$p(x) = \frac{1}{\beta} exp(-z - exp[-z])$$ where $$z = \frac{x - \mu}{\beta}.$$

On its own, the Gumbel distribution is typically used to model the maximum of a set of independent samples. For example, let's say you want to quantify how much ice cream you eat per day. Assume your hunger for ice cream is normally-distributed, with a mean of 5/10. You record your hunger 100 times a day for 10,000 days. (We also assume your hunger is erratic enough that all samples are independent.) You make a note of the maximum hunger you experience every day.

The distribution of daily maximum hunger would then follow a Gumbel distribution.

```
from scipy.optimize import curve_fit
mean_hunger = 5
samples_per_day = 100
n_days = 10000
samples = np.random.normal(loc=mean_hunger, size=(n_days, samples_per_day))
daily_maxes = np.max(samples, axis=1)
def gumbel_pdf(prob, loc, scale):
z = (prob - loc) / scale
return exp(-z - exp(-z)) / scale
def plot_maxes(daily_maxes):
probs, hungers, _ = hist(daily_maxes, normed=True, bins=100)
xlabel("Hunger")
ylabel("Probability of hunger being daily maximum")
(loc, scale), _ = curve_fit(gumbel_pdf, hungers[:-1], probs)
plot(hungers, gumbel_pdf(hungers, loc, scale))
figure()
plot_maxes(daily_maxes)
```

^{[1]}, the Gumbel distribution should be a good fit when the underlying data is distributed according to either a normal or an exponential distribution. To convince ourselves, let's try again with exponentially-distributed hunger:

```
most_likely_hunger = 5
samples = np.random.exponential(scale=most_likely_hunger,
size=(n_days, samples_per_day))
daily_maxes = np.max(samples, axis=1)
figure()
plot_maxes(daily_maxes)
```

Sure enough, the distribution of maximum daily hunger values is still Gumbel-distributed.

**Edit:** Varun Sudarsanan points out that for exponential distributions, some adjustment has to be done for the sample size. This doesn't show up here because the sample size is only 100, but it becomes more relevant the larger the sample size. Thanks, Varun!

What does the Gumbel distribution have to do with sampling from a categorical distribution?

To experiment, let's set up a distribution to work with.

```
n_cats = 7
cats = np.arange(n_cats)
probs = np.random.randint(low=1, high=20, size=n_cats)
probs = probs / sum(probs)
logits = log(probs)
def plot_probs():
bar(cats, probs)
xlabel("Category")
ylabel("Probability")
figure()
plot_probs()
```

As a sanity check, let's try sampling with `np.random.choice`

. Do we get the right probabilities?

```
n_samples = 1000
def plot_estimated_probs(samples):
n_cats = np.max(samples) + 1
estd_probs, _, _ = hist(samples,
bins=np.arange(n_cats + 1),
align='left',
edgecolor='white',
normed=True)
xlabel("Category")
ylabel("Estimated probability")
return estd_probs
def print_probs(probs):
print(" ".join(["{:.2f}"] * len(probs)).format(*probs))
samples = np.random.choice(cats, p=probs, size=n_samples)
figure()
subplot(1, 2, 1)
plot_probs()
subplot(1, 2, 2)
estd_probs = plot_estimated_probs(samples)
tight_layout()
print("Original probabilities:\t\t", end="")
print_probs(probs)
print("Estimated probabilities:\t", end="")
print_probs(estd_probs)
```

```
def sample(logits):
noise = tf.random_uniform(tf.shape(logits))
return tf.argmax(logits - tf.log(-tf.log(noise)), 1)
```

We had some intuition that maybe this works by just using some noise to "shake up" the argmax.

Will any noise do? Let's try a couple of different types.

```
def sample(logits):
noise = np.random.uniform(size=len(logits))
sample = np.argmax(logits + noise)
return sample
samples = [sample(logits) for _ in range(n_samples)]
figure()
subplot(1, 2, 1)
plot_probs()
subplot(1, 2, 2)
estd_probs = plot_estimated_probs(samples)
tight_layout()
print("Original probabilities:\t\t", end="")
print_probs(probs)
print("Estimated probabilities:\t", end="")
print_probs(estd_probs)
```

So uniform noise seems to capture the modes of the distribution, but distorted. It also completely misses out all the other categories.

```
def sample(logits):
noise = np.random.normal(size=len(logits))
sample = argmax(logits + noise)
return sample
samples = [sample(logits) for _ in range(n_samples)]
figure()
subplot(1, 2, 1)
plot_probs()
subplot(1, 2, 2)
estd_probs = plot_estimated_probs(samples)
tight_layout()
print("Original probabilities:\t\t", end="")
print_probs(probs)
print("Estimated probabilities:\t", end="")
print_probs(estd_probs)
```