데이터분석
pandas basic 02
jaehwi0823
2019. 10. 10. 20:08
01. Column Selection
there are a few ways to select columns in a DataFrame.
# select by indexing
mydf[[col1, col2, col3]]
# select by dtype
mydf.select_dtypes(include=['int'])
# select by filter
mydf.filter(like='flag')
mydf.filter(regex='\d')
mydf.filter(like='pre_')
# ex1) select columns having NaN
null_cols = (mydf.select_dtypes(['object']).isnull().sum()>0).to_list()
mydf.select_dtypes(['object']).loc[:, null_cols]
It is useful to group columns and align them
col_A = [col_A1, col_A2, col_A3]
col_B = [col_B1, col_B2]
col_C = [col_C1, col_C2, col_C3, col_C4]
mydf = mydf[col_A + col_B + col_C]
02. Basic Calculation
Apply basic calculations to all columns
# ignore NaN
mydf.count()
mydf.min()
mydf.max()
mydf.sum()
mydf.cumsum()
# consider NaN
mydf.sum(skipna=False)
# ex1) check nulls
mydf.isnull().sum()
mydf.isnull().sum().sum()
# basic calculation
mydf + 2019 # works only when num type
mydf == value
mydf1 == mydf2 # doesn't work when NaN exists
mydf.equal(mydf2)
03. Memory Save
# check
mydf.memory_usage(deep=True)
# type change
mydf.col1 = mydf.col1.astype(np.int8)
# to Categorical
mydf.select_dtypes(include=['object']).nunique()
mydf.col2 = mydf.col2.astype('category')
04. Largest & Smallest
top1000 = mydf.nlargest(1000, 'col1')
target = top1000.nsmallest(10, 'col2')