Sigmoid is Softmax (specialized to 0)

Today, I discovered that the sigmoid function is equivalent to softmax when applied to two inputs, with one input fixed to zero. With almost a decade of experience in machine learning, and earlier work in compute for signal processing, I had never considered this connection until now.

Sigmoid and softmax are two common mathematical functions in machine learning, especially in classification tasks. If you apply a sigmoid function to a value, followed by a linear transformation, it is equivalent to applying a softmax function to a list of two inputs where one input is fixed. If the input is zero, no transformation is needed.

Sigmoid squashes any real number into a value between 0 and 1. You can think of it as a smooth version of a binary threshold.

Softmax generalizes this idea to multiple values. It turns a list of numbers into probabilities that sum to 1, emphasizing the largest numbers more.

Sigmoid and Softmax in Python

Here are the two formulas:

import numpy as np

# Sigmoid function
def sigmoid(x): return (1 / (1 + np.exp(-x)))

# Softmax over a list of inputs
def softmax(inputs): return np.exp(inputs) / np.sum(np.exp(inputs))

sigmoid is just softmax over the original argument and zero.

def sigmoid(x): return softmax([x, 0])[0]

Interestingly, softmax has roots in physics. It comes from ideas related to entropy and the Boltzmann distribution, which describes the probability of particles being in certain energy states. Machine learning borrows this concept to model probabilities over choices.

Back in the day, we used relu for our non-linear function between layers. Never really gave much thought about the change to sigmoid-based ML. My job was to compile the function to efficient code, and sigmoid is more challenging to compile for than relu. However, I now see there is a deeper reason why sigmoid makes sense here.