speed up the process of import multiple csv into python dataframe -
i read multiple csv files (hundreds of files,hundreds of lines each same number of columns) target directory single python pandas dataframe.
the code below wrote works slow.it takes minutes run 30 files(so how long should wait if load of files). can alter make work faster?
besides, in replace
function, want replace "_"(don't know encoding, not normal one) "-"(normal utf-8), how can that? use coding=latin-1
because have french accents in files.
#coding=latin-1 import pandas pd import glob pd.set_option('expand_frame_repr', false) path = r'd:\python27\mypfe\data_test' allfiles = glob.glob(path + "/*.csv") frame = pd.dataframe() list_ = [] file_ in allfiles: df = pd.read_csv(file_, index_col = none, header = 0, sep = ';', dayfirst = true, parse_dates=['heureprevue','heuredebuttrajet','heurearriveesursite','heureeffective']) df.drop(labels=['apaye','methodepaiement','argentpercu'],axis=1,inplace=true) df['sens'].replace("\n", "-", inplace=true,regex=true) list_.append(df) print "fichier lu:",file_ frame = pd.concat(list_) print frame
you may try following - read columns need, use list comprehension , call pd.concat([ ... ], ignore_index=true)
once, because it's pretty slow:
# there no sense read columns don't need # specify column list (excluding: 'apaye','methodepaiement','argentpercu') cols = ['col1', 'col2', 'etc.'] date_cols = ['heureprevue','heuredebuttrajet','heurearriveesursite','heureeffective'] df = pd.concat( [pd.read_csv(f, sep = ';', dayfirst = true, usecols=cols, parse_dates=date_cols) f in allfiles ], ignore_index=true )
this should work if have enough memory store two resulting dfs...
Comments
Post a Comment