python - Creating new matrix from dataframe and matrix in pandas -
i have dataframe df
looks this:
id1 id2 weights 0 2a 144.0 1 2b 52.5 2 2c 2.0 3 2d 1.0 4 2e 1.0 5 b 2a 2.0 6 b 2e 1.0 7 b 2f 1.0 8 b 2b 1.0 9 b 2c 0.008
and similarity matrix mat
between elements of id2
column:
2a 2b 2c 2d 2e 2f 2a 1 0.5 0.7 0.2 0.1 0.3 2b 0.5 1 0.6 0.4 0.3 0.4 2c 0.7 0.6 1 0.1 0.4 0.2 2d 0.2 0.4 0.1 1 0.8 0.7 2e 0.1 0.3 0.4 0.8 1 0.8 2f 0.3 0.4 0.2 0.7 0.8 1
now create similarity matrix between elements of id1
, elements id2
. consider elements of id1
barycentres of corresponding elements of id2
ind dataframe df
(with corresponding weights
).
my first attempt loops (aouch):
ids = df.id1.unique() output = pd.dataframe(columns = mat.columns,index = ids) id in ids: df_slice = df.loc[df.id1 == id] to_normalize = df_slice.weights.sum() temp = mat.loc[df_slice.id2] art in df_slice.id2: temp.loc[art] *= df_slice.ix[df_slice.id2 == art,'weights'].values[0] temp.loc[art] /= (1.*to_normalize) output.loc[id] = temp.sum()
but of course way not pythonic, , takes ages (timeit
these small matrix showed 21.3ms
not computable 10k-rows df
, 3k 3k mat
). more clean/efficient way it?
desired output:
2a 2b 2c 2d 2e 2f 0.857606 0.630424 0.672319 0.258354 0.163342 0.329676 b 0.580192 0.540096 0.520767 0.459425 0.459904 0.559425
and there way compute similarity matrix between elements of id1
(from data)?
thank in advance.
the following clocks in @ 6–7ms (vs. around 30ms approach takes on machine).
import io import pandas pd raw_df = io.stringio("""\ id1 id2 weights 0 2a 144.0 1 2b 52.5 2 2c 2.0 3 2d 1.0 4 2e 1.0 5 b 2a 2.0 6 b 2e 1.0 7 b 2f 1.0 8 b 2b 1.0 9 b 2c 0.008 """) df = pd.read_csv(raw_df, delim_whitespace=true) raw_mat = io.stringio("""\ 2a 2b 2c 2d 2e 2f 2a 1 0.5 0.7 0.2 0.1 0.3 2b 0.5 1 0.6 0.4 0.3 0.4 2c 0.7 0.6 1 0.1 0.4 0.2 2d 0.2 0.4 0.1 1 0.8 0.7 2e 0.1 0.3 0.4 0.8 1 0.8 2f 0.3 0.4 0.2 0.7 0.8 1 """) mat = pd.read_csv(raw_mat, delim_whitespace=true) df['norm'] = df.groupby('id1')['weights'].transform('sum') m = pd.merge(df, mat, left_on='id2', right_index=true) m[mat.index] = m[mat.index].multiply(m['weights'] / m['norm'], axis=0) output = m.groupby('id1')[mat.index].sum() output.columns.name = 'id2' print(output)
output:
id2 2a 2b 2c 2d 2e 2f id1 0.857606 0.630424 0.672319 0.258354 0.163342 0.329676 b 0.580192 0.540096 0.520767 0.459425 0.459904 0.559425
Comments
Post a Comment