python - Creating new matrix from dataframe and matrix in pandas -


i have dataframe df looks this:

    id1  id2  weights 0      2a   144.0 1      2b   52.5 2      2c   2.0 3      2d   1.0 4      2e   1.0 5   b    2a   2.0 6   b    2e   1.0 7   b    2f   1.0 8   b    2b   1.0 9   b    2c   0.008 

and similarity matrix mat between elements of id2 column:

    2a    2b   2c   2d   2e   2f 2a  1     0.5  0.7  0.2  0.1  0.3 2b  0.5   1    0.6  0.4  0.3  0.4 2c  0.7   0.6  1    0.1  0.4  0.2 2d  0.2   0.4  0.1  1    0.8  0.7 2e  0.1   0.3  0.4  0.8  1    0.8 2f  0.3   0.4  0.2  0.7  0.8  1 

now create similarity matrix between elements of id1 , elements id2. consider elements of id1 barycentres of corresponding elements of id2 ind dataframe df (with corresponding weights).

my first attempt loops (aouch):

ids = df.id1.unique() output = pd.dataframe(columns = mat.columns,index = ids) id in ids:     df_slice = df.loc[df.id1 == id]     to_normalize = df_slice.weights.sum()     temp = mat.loc[df_slice.id2]     art in df_slice.id2:         temp.loc[art] *= df_slice.ix[df_slice.id2 == art,'weights'].values[0]         temp.loc[art] /= (1.*to_normalize)     output.loc[id] = temp.sum() 

but of course way not pythonic, , takes ages (timeit these small matrix showed 21.3ms not computable 10k-rows df , 3k 3k mat). more clean/efficient way it?

desired output:

    2a          2b          2c          2d          2e          2f   0.857606    0.630424    0.672319    0.258354    0.163342    0.329676 b   0.580192    0.540096    0.520767    0.459425    0.459904    0.559425 

and there way compute similarity matrix between elements of id1 (from data)?

thank in advance.

the following clocks in @ 6–7ms (vs. around 30ms approach takes on machine).

import io  import pandas pd   raw_df = io.stringio("""\   id1  id2  weights 0      2a   144.0 1      2b   52.5 2      2c   2.0 3      2d   1.0 4      2e   1.0 5   b    2a   2.0 6   b    2e   1.0 7   b    2f   1.0 8   b    2b   1.0 9   b    2c   0.008 """) df = pd.read_csv(raw_df, delim_whitespace=true)  raw_mat = io.stringio("""\     2a    2b   2c   2d   2e   2f 2a  1     0.5  0.7  0.2  0.1  0.3 2b  0.5   1    0.6  0.4  0.3  0.4 2c  0.7   0.6  1    0.1  0.4  0.2 2d  0.2   0.4  0.1  1    0.8  0.7 2e  0.1   0.3  0.4  0.8  1    0.8 2f  0.3   0.4  0.2  0.7  0.8  1 """) mat = pd.read_csv(raw_mat, delim_whitespace=true)   df['norm'] = df.groupby('id1')['weights'].transform('sum')  m = pd.merge(df, mat, left_on='id2', right_index=true) m[mat.index] = m[mat.index].multiply(m['weights'] / m['norm'], axis=0)  output = m.groupby('id1')[mat.index].sum() output.columns.name = 'id2' print(output)     

output:

id2        2a        2b        2c        2d        2e        2f id1                                                                0.857606  0.630424  0.672319  0.258354  0.163342  0.329676 b    0.580192  0.540096  0.520767  0.459425  0.459904  0.559425 

Comments

Popular posts from this blog

Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12:test (default-test) on project.Error occurred in starting fork -

windows - Debug iNetMgr.exe unhandle exception System.Management.Automation.CmdletInvocationException -

configurationsection - activeMq-5.13.3 setup configurations for wildfly 10.0.0 -