A Pandas suprise - NaNs and with groupby
My summary
import pandas as pd
import numpy as np
from pandas.errors import UnsupportedFunctionCall
I figured out something about pandas today, which I was very surprised by.
Applying .groupby
on a pd.DataFrame
automatically ignores NaN
values. This is intendet behavior, but sometimes you actually want to have some NaN
in the data, to check whether your data-frame is correct and to find possible corruptions.
Here is a little example:
DF = pd.DataFrame.from_dict({'g1': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
'g2': ['c', 'c', 'd', 'd', 'c', 'c', 'd', 'd'],
'd1': [0, 1, np.nan, 3, 4, 5, 6, 7]})
Averaging the entries in DF
, we would expect a NaN
in group a
, d
, but we get 3.0
!
DF.groupby(['g1', 'g2']).mean()
If you apply pandas .mean()
method on a DataFrame
you could speciy a skipna = False
in the function. This, unfortunately doesn't work after using .groupby
.
try:
DF.groupby(['g1', 'g2']).mean(skipna=False)
except UnsupportedFunctionCall:
print('UnsupportedFunctionCall')
I think, I have seen one solution to solve this issue statingt that using .apply(np.mean)
instead of using .mean()
might solve the problem.
However:
DF.groupby(['g1', 'g2']).apply(np.mean)
Calling np.mean
causes pandas to bypass the function and calls DF.mean()
from pandas with skipna=True
!
As far as I know, you have to create a new function to solve the issue.
def mean_w_nan(x):
# Don't forget the np.array call!
return np.mean(np.array(x))
DF.groupby(['g1', 'g2']).apply(mean_w_nan)