grouping a pandas DataFrame with predifined groups2019 Community Moderator ElectionAdd one row to pandas DataFrameSelecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrame by column nameHow do I get the row count of a Pandas dataframe?How to iterate over rows in a DataFrame in Pandas?Pandas writing dataframe to CSV fileSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headers

Do native speakers use "ultima" and "proxima" frequently in spoken English?

Print last inputted byte

Why doesn't the fusion process of the sun speed up?

Why are there no stars visible in cislunar space?

What will the Frenchman say?

How can I query the supported timezones in Apex?

Should I be concerned about student access to a test bank?

Why is "la Gestapo" feminine?

Why doesn't the chatan sign the ketubah?

What is the reasoning behind standardization (dividing by standard deviation)?

Extraneous elements in "Europe countries" list

Difficulty understanding group delay concept

When did hardware antialiasing start being available?

Single word to change groups

When should a starting writer get his own webpage?

Can other pieces capture a threatening piece and prevent a checkmate?

What is the difference between something being completely legal and being completely decriminalized?

Jem'Hadar, something strange about their life expectancy

Help with identifying unique aircraft over NE Pennsylvania

Error in master's thesis, I do not know what to do

UK Tourist Visa- Enquiry

How are passwords stolen from companies if they only store hashes?

How can a new country break out from a developed country without war?

How do researchers send unsolicited emails asking for feedback on their works?

grouping a pandas DataFrame with predifined groups

2019 Community Moderator ElectionAdd one row to pandas DataFrameSelecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrame by column nameHow do I get the row count of a Pandas dataframe?How to iterate over rows in a DataFrame in Pandas?Pandas writing dataframe to CSV fileSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headers

I am wondering, how to efficiently do something like groupby when I have predefined groups, and elements might belong to multiple groups at the same time.

Suppose, I have the following DataFrame:

df = pd.DataFrame('value': [0, 2, 4], index=['A', 'B', 'C'])

 value
A 0
B 2
C 4

and I have the following predefined groups, which might be overlapping and of different size:

groups = 'group 1': ['A', 'B'],
 'group 2': ['A', 'B', 'C']

Now, I want to perform a function on the DataFrame groups. For example, I want to calculate the mean of value for each group.

I was thinking to create an intermediate "expanded" DataFrame on which I could run a groupby:

intermediate_df = pd.DataFrame(columns=['id', 'group', 'value'])
intermediate_df['value'] = intermediate_df['value'].astype(float)

for group, members in groups.items():
 for id_ in members:
 row = pd.Series([id_, group, df.at[id_, 'value']],
 index=['id', 'group', 'value'])
 intermediate_df = intermediate_df.append(row, ignore_index=True)

 id group value
0 A group 1 0.0
1 B group 1 2.0
2 A group 2 0.0
3 B group 2 2.0
4 C group 2 4.0

Then, I could do

intermediate_df.groupby('group').mean()

which would give me the desired result:

 value
group 
group 1 1.0
group 2 2.0

Of course, the way I create this intermediate DataFrame is absolutely inefficient. What would be an efficient solution for my problem?

edited Mar 8 at 14:10

asked Mar 7 at 18:35

Fabian Rost

1,241725

add a comment |

I am wondering, how to efficiently do something like groupby when I have predefined groups, and elements might belong to multiple groups at the same time.

Suppose, I have the following DataFrame:

df = pd.DataFrame('value': [0, 2, 4], index=['A', 'B', 'C'])

 value
A 0
B 2
C 4

and I have the following predefined groups, which might be overlapping and of different size:

groups = 'group 1': ['A', 'B'],
 'group 2': ['A', 'B', 'C']

Now, I want to perform a function on the DataFrame groups. For example, I want to calculate the mean of value for each group.

I was thinking to create an intermediate "expanded" DataFrame on which I could run a groupby:

intermediate_df = pd.DataFrame(columns=['id', 'group', 'value'])
intermediate_df['value'] = intermediate_df['value'].astype(float)

for group, members in groups.items():
 for id_ in members:
 row = pd.Series([id_, group, df.at[id_, 'value']],
 index=['id', 'group', 'value'])
 intermediate_df = intermediate_df.append(row, ignore_index=True)

 id group value
0 A group 1 0.0
1 B group 1 2.0
2 A group 2 0.0
3 B group 2 2.0
4 C group 2 4.0

Then, I could do

intermediate_df.groupby('group').mean()

which would give me the desired result:

 value
group 
group 1 1.0
group 2 2.0

Of course, the way I create this intermediate DataFrame is absolutely inefficient. What would be an efficient solution for my problem?

edited Mar 8 at 14:10

asked Mar 7 at 18:35

Fabian Rost

1,241725

add a comment |

I am wondering, how to efficiently do something like groupby when I have predefined groups, and elements might belong to multiple groups at the same time.

Suppose, I have the following DataFrame:

df = pd.DataFrame('value': [0, 2, 4], index=['A', 'B', 'C'])

 value
A 0
B 2
C 4

and I have the following predefined groups, which might be overlapping and of different size:

groups = 'group 1': ['A', 'B'],
 'group 2': ['A', 'B', 'C']

Now, I want to perform a function on the DataFrame groups. For example, I want to calculate the mean of value for each group.

I was thinking to create an intermediate "expanded" DataFrame on which I could run a groupby:

intermediate_df = pd.DataFrame(columns=['id', 'group', 'value'])
intermediate_df['value'] = intermediate_df['value'].astype(float)

for group, members in groups.items():
 for id_ in members:
 row = pd.Series([id_, group, df.at[id_, 'value']],
 index=['id', 'group', 'value'])
 intermediate_df = intermediate_df.append(row, ignore_index=True)

 id group value
0 A group 1 0.0
1 B group 1 2.0
2 A group 2 0.0
3 B group 2 2.0
4 C group 2 4.0

Then, I could do

intermediate_df.groupby('group').mean()

which would give me the desired result:

 value
group 
group 1 1.0
group 2 2.0

Of course, the way I create this intermediate DataFrame is absolutely inefficient. What would be an efficient solution for my problem?

edited Mar 8 at 14:10

asked Mar 7 at 18:35

Fabian Rost

1,241725

I am wondering, how to efficiently do something like groupby when I have predefined groups, and elements might belong to multiple groups at the same time.

Suppose, I have the following DataFrame:

df = pd.DataFrame('value': [0, 2, 4], index=['A', 'B', 'C'])

 value
A 0
B 2
C 4

and I have the following predefined groups, which might be overlapping and of different size:

groups = 'group 1': ['A', 'B'],
 'group 2': ['A', 'B', 'C']

Now, I want to perform a function on the DataFrame groups. For example, I want to calculate the mean of value for each group.

I was thinking to create an intermediate "expanded" DataFrame on which I could run a groupby:

intermediate_df = pd.DataFrame(columns=['id', 'group', 'value'])
intermediate_df['value'] = intermediate_df['value'].astype(float)

for group, members in groups.items():
 for id_ in members:
 row = pd.Series([id_, group, df.at[id_, 'value']],
 index=['id', 'group', 'value'])
 intermediate_df = intermediate_df.append(row, ignore_index=True)

 id group value
0 A group 1 0.0
1 B group 1 2.0
2 A group 2 0.0
3 B group 2 2.0
4 C group 2 4.0

Then, I could do

intermediate_df.groupby('group').mean()

which would give me the desired result:

 value
group 
group 1 1.0
group 2 2.0

Of course, the way I create this intermediate DataFrame is absolutely inefficient. What would be an efficient solution for my problem?

python pandas dataframe pandas-groupby

edited Mar 8 at 14:10

asked Mar 7 at 18:35

Fabian Rost

1,241725

edited Mar 8 at 14:10

asked Mar 7 at 18:35

Fabian Rost

1,241725

edited Mar 8 at 14:10

asked Mar 7 at 18:35

Fabian Rost

1,241725

asked Mar 7 at 18:35

Fabian Rost

1,241725

asked Mar 7 at 18:35

Fabian Rost

1,241725

add a comment |

3 Answers
3

active

oldest

votes

You can create your intermediate_df with Pandas.concat and a list comprehension:

intermediate_df = pd.concat([df.loc[v].assign(group=k) for k, v in groups.items()])

[OUT]

 value group
A 0 group 1
B 2 group 1
A 0 group 2
C 4 group 2

answered Mar 7 at 18:57

Chris A

3,420417

add a comment |

Edit try for uneven groups:

pd.DataFrame().from_dict(groups, orient='index').T.stack().map(df.squeeze()).mean(level=1)

You can do it this way also:

pd.DataFrame(groups).stack().map(df.squeeze()).mean(level=1)

Output:

group 1 1
group 2 2
dtype: int64

edited Mar 8 at 20:39

answered Mar 7 at 19:06

Scott Boston

56.7k73158

Thanks! Unfortunately, I forgot to specify that the groups might be of different size. Then, your answer does not work any more, but this was my fault. I updated the question.

– Fabian Rost
Mar 8 at 14:11

1

@FabianRost see update.

– Scott Boston
Mar 8 at 20:39

add a comment |

Building on previous answers, I use list comprehension for an intermediate_df

intermediate_df = pd.DataFrame([[group, id_] for group, members in groups.items() for id_ in members], 
 columns=['group', 'id']).merge(df, left_on='id', right_index=True)

This seems to be the fastest solution compared to the other answers:

n=10000
m=1000
df = pd.DataFrame('value': np.random.normal(size=n), index=np.arange(n).astype(str))
groups = str(i): list(df.sample(5).index) for i in range(m)

%%timeit
intermediate_df = pd.concat([df.loc[members].assign(group=group) for group, members in groups.items()])
intermediate_df.groupby('group').mean()

948 ms ± 63.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
pd.DataFrame(groups).stack().map(df.squeeze()).mean(level=1)

42.4 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
intermediate_df = pd.DataFrame([[group, id_] for group, members in groups.items() for id_ in members], 
 columns=['group', 'id']).merge(df, left_on='id', right_index=True)
intermediate_df.groupby('group').mean()

6.13 ms ± 50.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

answered Mar 8 at 14:47

Fabian Rost

1,241725

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55050634%2fgrouping-a-pandas-dataframe-with-predifined-groups%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

You can create your intermediate_df with Pandas.concat and a list comprehension:

intermediate_df = pd.concat([df.loc[v].assign(group=k) for k, v in groups.items()])

[OUT]

 value group
A 0 group 1
B 2 group 1
A 0 group 2
C 4 group 2

answered Mar 7 at 18:57

Chris A

3,420417

add a comment |

You can create your intermediate_df with Pandas.concat and a list comprehension:

intermediate_df = pd.concat([df.loc[v].assign(group=k) for k, v in groups.items()])

[OUT]

 value group
A 0 group 1
B 2 group 1
A 0 group 2
C 4 group 2

answered Mar 7 at 18:57

Chris A

3,420417

add a comment |

You can create your intermediate_df with Pandas.concat and a list comprehension:

intermediate_df = pd.concat([df.loc[v].assign(group=k) for k, v in groups.items()])

[OUT]

 value group
A 0 group 1
B 2 group 1
A 0 group 2
C 4 group 2

answered Mar 7 at 18:57

Chris A

3,420417

You can create your intermediate_df with Pandas.concat and a list comprehension:

intermediate_df = pd.concat([df.loc[v].assign(group=k) for k, v in groups.items()])

[OUT]

 value group
A 0 group 1
B 2 group 1
A 0 group 2
C 4 group 2

answered Mar 7 at 18:57

Chris A

3,420417

answered Mar 7 at 18:57

Chris A

3,420417

answered Mar 7 at 18:57

Chris A

3,420417

answered Mar 7 at 18:57

Chris A

3,420417

add a comment |

Edit try for uneven groups:

pd.DataFrame().from_dict(groups, orient='index').T.stack().map(df.squeeze()).mean(level=1)

You can do it this way also:

pd.DataFrame(groups).stack().map(df.squeeze()).mean(level=1)

Output:

group 1 1
group 2 2
dtype: int64

edited Mar 8 at 20:39

answered Mar 7 at 19:06

Scott Boston

56.7k73158

Thanks! Unfortunately, I forgot to specify that the groups might be of different size. Then, your answer does not work any more, but this was my fault. I updated the question.

– Fabian Rost
Mar 8 at 14:11

1

@FabianRost see update.

– Scott Boston
Mar 8 at 20:39

add a comment |

Edit try for uneven groups:

pd.DataFrame().from_dict(groups, orient='index').T.stack().map(df.squeeze()).mean(level=1)

You can do it this way also:

pd.DataFrame(groups).stack().map(df.squeeze()).mean(level=1)

Output:

group 1 1
group 2 2
dtype: int64

edited Mar 8 at 20:39

answered Mar 7 at 19:06

Scott Boston

56.7k73158

Thanks! Unfortunately, I forgot to specify that the groups might be of different size. Then, your answer does not work any more, but this was my fault. I updated the question.

– Fabian Rost
Mar 8 at 14:11

1

@FabianRost see update.

– Scott Boston
Mar 8 at 20:39

add a comment |

Edit try for uneven groups:

pd.DataFrame().from_dict(groups, orient='index').T.stack().map(df.squeeze()).mean(level=1)

You can do it this way also:

pd.DataFrame(groups).stack().map(df.squeeze()).mean(level=1)

Output:

group 1 1
group 2 2
dtype: int64

edited Mar 8 at 20:39

answered Mar 7 at 19:06

Scott Boston

56.7k73158

Edit try for uneven groups:

pd.DataFrame().from_dict(groups, orient='index').T.stack().map(df.squeeze()).mean(level=1)

You can do it this way also:

pd.DataFrame(groups).stack().map(df.squeeze()).mean(level=1)

Output:

group 1 1
group 2 2
dtype: int64

edited Mar 8 at 20:39

answered Mar 7 at 19:06

Scott Boston

56.7k73158

edited Mar 8 at 20:39

answered Mar 7 at 19:06

Scott Boston

56.7k73158

answered Mar 7 at 19:06

Scott Boston

56.7k73158

answered Mar 7 at 19:06

Scott Boston

56.7k73158

Thanks! Unfortunately, I forgot to specify that the groups might be of different size. Then, your answer does not work any more, but this was my fault. I updated the question.

– Fabian Rost
Mar 8 at 14:11

1

@FabianRost see update.

– Scott Boston
Mar 8 at 20:39

add a comment |

Thanks! Unfortunately, I forgot to specify that the groups might be of different size. Then, your answer does not work any more, but this was my fault. I updated the question.

– Fabian Rost
Mar 8 at 14:11

1

@FabianRost see update.

– Scott Boston
Mar 8 at 20:39

Thanks! Unfortunately, I forgot to specify that the groups might be of different size. Then, your answer does not work any more, but this was my fault. I updated the question.

– Fabian Rost
Mar 8 at 14:11

@FabianRost see update.

– Scott Boston
Mar 8 at 20:39

add a comment |

Building on previous answers, I use list comprehension for an intermediate_df

intermediate_df = pd.DataFrame([[group, id_] for group, members in groups.items() for id_ in members], 
 columns=['group', 'id']).merge(df, left_on='id', right_index=True)

This seems to be the fastest solution compared to the other answers:

n=10000
m=1000
df = pd.DataFrame('value': np.random.normal(size=n), index=np.arange(n).astype(str))
groups = str(i): list(df.sample(5).index) for i in range(m)

%%timeit
intermediate_df = pd.concat([df.loc[members].assign(group=group) for group, members in groups.items()])
intermediate_df.groupby('group').mean()

948 ms ± 63.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
pd.DataFrame(groups).stack().map(df.squeeze()).mean(level=1)

42.4 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
intermediate_df = pd.DataFrame([[group, id_] for group, members in groups.items() for id_ in members], 
 columns=['group', 'id']).merge(df, left_on='id', right_index=True)
intermediate_df.groupby('group').mean()

6.13 ms ± 50.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

answered Mar 8 at 14:47

Fabian Rost

1,241725

add a comment |

Building on previous answers, I use list comprehension for an intermediate_df

intermediate_df = pd.DataFrame([[group, id_] for group, members in groups.items() for id_ in members], 
 columns=['group', 'id']).merge(df, left_on='id', right_index=True)

This seems to be the fastest solution compared to the other answers:

n=10000
m=1000
df = pd.DataFrame('value': np.random.normal(size=n), index=np.arange(n).astype(str))
groups = str(i): list(df.sample(5).index) for i in range(m)

%%timeit
intermediate_df = pd.concat([df.loc[members].assign(group=group) for group, members in groups.items()])
intermediate_df.groupby('group').mean()

948 ms ± 63.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
pd.DataFrame(groups).stack().map(df.squeeze()).mean(level=1)

42.4 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
intermediate_df = pd.DataFrame([[group, id_] for group, members in groups.items() for id_ in members], 
 columns=['group', 'id']).merge(df, left_on='id', right_index=True)
intermediate_df.groupby('group').mean()

6.13 ms ± 50.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

answered Mar 8 at 14:47

Fabian Rost

1,241725

add a comment |

Building on previous answers, I use list comprehension for an intermediate_df

intermediate_df = pd.DataFrame([[group, id_] for group, members in groups.items() for id_ in members], 
 columns=['group', 'id']).merge(df, left_on='id', right_index=True)

This seems to be the fastest solution compared to the other answers:

n=10000
m=1000
df = pd.DataFrame('value': np.random.normal(size=n), index=np.arange(n).astype(str))
groups = str(i): list(df.sample(5).index) for i in range(m)

%%timeit
intermediate_df = pd.concat([df.loc[members].assign(group=group) for group, members in groups.items()])
intermediate_df.groupby('group').mean()

948 ms ± 63.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
pd.DataFrame(groups).stack().map(df.squeeze()).mean(level=1)

42.4 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
intermediate_df = pd.DataFrame([[group, id_] for group, members in groups.items() for id_ in members], 
 columns=['group', 'id']).merge(df, left_on='id', right_index=True)
intermediate_df.groupby('group').mean()

6.13 ms ± 50.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

answered Mar 8 at 14:47

Fabian Rost

1,241725

Building on previous answers, I use list comprehension for an intermediate_df

intermediate_df = pd.DataFrame([[group, id_] for group, members in groups.items() for id_ in members], 
 columns=['group', 'id']).merge(df, left_on='id', right_index=True)

This seems to be the fastest solution compared to the other answers:

n=10000
m=1000
df = pd.DataFrame('value': np.random.normal(size=n), index=np.arange(n).astype(str))
groups = str(i): list(df.sample(5).index) for i in range(m)

%%timeit
intermediate_df = pd.concat([df.loc[members].assign(group=group) for group, members in groups.items()])
intermediate_df.groupby('group').mean()

948 ms ± 63.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
pd.DataFrame(groups).stack().map(df.squeeze()).mean(level=1)

42.4 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
intermediate_df = pd.DataFrame([[group, id_] for group, members in groups.items() for id_ in members], 
 columns=['group', 'id']).merge(df, left_on='id', right_index=True)
intermediate_df.groupby('group').mean()

6.13 ms ± 50.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

answered Mar 8 at 14:47

Fabian Rost

1,241725

answered Mar 8 at 14:47

Fabian Rost

1,241725

answered Mar 8 at 14:47

Fabian Rost

1,241725

answered Mar 8 at 14:47

Fabian Rost

1,241725

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ggtcf

3 Answers
3

Your Answer

Post as a guest

3 Answers
3

3 Answers
3

Post as a guest

Popular posts from this blog

Thal And Out Agency railway station See also References External links Navigation menuOfficial Web Site of Pakistan RailwaysArchivedOfficial Web Site of Pakistan Railwayseeexpanding ite

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

3 Answers 3

3 Answers 3

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Thal And Out Agency railway station See also References External links Navigation menuOfficial Web Site of Pakistan RailwaysArchivedOfficial Web Site of Pakistan Railwayseeexpanding ite

3 Answers
3

3 Answers
3

3 Answers
3