Unlike light-tailed settings, the sample mean is not well-behaved in heavy-tailed settings. Since heavy-tailed distributions may not have finite MGFs, the Chernoff method is not applicable. Catoni gives an example demonstrating the bound achieved via Markov’s inequality (basic inequalities), i.e.,

is essentially tight (where we receive observations and is the sample mean). The issue is that outliers can have devastating effects on the sample mean, and heavy-tailed distributions can have many extreme observations. See this discussion by Lugosi and Mendelson for more details.

Ideally we want estimators with sub-Gaussian like behavior, i.e.,

This is an exponential improvement in the dependence on . These are called sub-Gaussian estimators.

Median-of-means (MoM)

The idea is simple: Split the data into buckets, and compute the sample mean of each bucket. With probability at least by Chebyshev, is not too far from the true mean. So for the median to be far from the mean, many (at least half) independent Bernoulli events with probability 3/4 must fail to occur.

If we choose then the median-of-means estimator satisfies

The MoM estimator can also be extended to situations where the distribution has no variance. If , then

This was originally proved by Bubeck, Cesa-Bianchi, and Lugosi in the content of heavy-tailed bandits. I have a post with the proofs here.

You can sequentialize the MoM estimator using the Dubins-Savage inequality. I have a post about that here.

Trimmed mean estimator

Since the empirical mean is ruined by outliers, a natural idea is to remove some fraction of the points and then compute the empirical mean on the remaining points. These are known as trimmed mean estimators.

Oliveira and Orenstein prove that if you remove of the largest and smallest points for , then the trimmed mean is sub-Gaussian. See here for a discussion.

A related idea is proposed by Lee and Valiant who don’t remove entire points but instead weighted fractions of each observations:

At the highest level: in order to return a -robust estimate of the mean, our estimator “throws out the most extreme points in the samples”, and returns the mean of what remains. More specifically, outliers are thrown out in a weighted manner, where we throw out a fraction of each data point, with the fraction proportional to the square of its distance from a median-of-means initial guess for the mean, where the fraction is capped at 1, and the proportionality constant is chosen so that the total weight thrown out equals exactly .

Their estimator achieves a bound of .

Catoni-Giulini M-estimator