import pandas as pd
import numpy as np
from pandas.errors import UnsupportedFunctionCallA Pandas surprise - NaNs and groupby
I figured out something about pandas today, which I was very surprised by. Applying .groupby on a pd.DataFrame automatically ignores NaN values. This is intended behavior, but sometimes you actually want to have some NaN in the data, to check whether your data-frame is correct and to find possible corruptions.
Here is a little example:
# Create a sample Array:
DF = pd.DataFrame.from_dict({'g1': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
'g2': ['c', 'c', 'd', 'd', 'c', 'c', 'd', 'd'],
'd1': [0, 1, np.nan, 3, 4, 5, 6, 7]})Averaging the entries in DF, we would expect a NaN in group a, d, but we get 3.0!
DF.groupby(['g1', 'g2']).mean()If you apply pandas .mean() method on a DataFrame you could specify a skipna = False in the function. This, unfortunately doesn’t work after using .groupby.
# This creates an error
try:
DF.groupby(['g1', 'g2']).mean(skipna=False)
except UnsupportedFunctionCall:
print('UnsupportedFunctionCall')I think, I have seen one solution to solve this issue stating that using .apply(np.mean) instead of using .mean() might solve the problem.
However:
DF.groupby(['g1', 'g2']).apply(np.mean)Calling np.mean causes pandas to bypass the function and calls DF.mean() from pandas with skipna=True! As far as I know, you have to create a new function to solve the issue.
def mean_w_nan(x):
# Don't forget the np.array call!
return np.mean(np.array(x))
DF.groupby(['g1', 'g2']).apply(mean_w_nan)