A Pandas surprise - NaNs with groupby

Python
Author

Simon R. Steinkamp

Published

April 28, 2020

A Pandas surprise - NaNs and groupby

import pandas as pd
import numpy as np
from pandas.errors import UnsupportedFunctionCall

I figured out something about pandas today, which I was very surprised by. Applying .groupby on a pd.DataFrame automatically ignores NaN values. This is intended behavior, but sometimes you actually want to have some NaN in the data, to check whether your data-frame is correct and to find possible corruptions.

Here is a little example:

# Create a sample Array:
DF = pd.DataFrame.from_dict({'g1': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], 
                             'g2': ['c', 'c', 'd', 'd', 'c', 'c', 'd', 'd'],
                             'd1': [0, 1, np.nan, 3, 4, 5, 6, 7]})

Averaging the entries in DF, we would expect a NaN in group a, d, but we get 3.0!

DF.groupby(['g1', 'g2']).mean()

If you apply pandas .mean() method on a DataFrame you could specify a skipna = False in the function. This, unfortunately doesn’t work after using .groupby.

# This creates an error
try:
    DF.groupby(['g1', 'g2']).mean(skipna=False)
except UnsupportedFunctionCall:
    print('UnsupportedFunctionCall')

I think, I have seen one solution to solve this issue stating that using .apply(np.mean) instead of using .mean() might solve the problem.

However:

DF.groupby(['g1', 'g2']).apply(np.mean)

Calling np.mean causes pandas to bypass the function and calls DF.mean() from pandas with skipna=True! As far as I know, you have to create a new function to solve the issue.

def mean_w_nan(x):
    # Don't forget the np.array call!
    return np.mean(np.array(x))

DF.groupby(['g1', 'g2']).apply(mean_w_nan)

References: