grouping a pandas DataFrame with predifined groups2019 Community Moderator ElectionAdd one row to pandas DataFrameSelecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrame by column nameHow do I get the row count of a Pandas dataframe?How to iterate over rows in a DataFrame in Pandas?Pandas writing dataframe to CSV fileSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headers
Do native speakers use "ultima" and "proxima" frequently in spoken English?
Print last inputted byte
Why doesn't the fusion process of the sun speed up?
Why are there no stars visible in cislunar space?
What will the Frenchman say?
How can I query the supported timezones in Apex?
Should I be concerned about student access to a test bank?
Why is "la Gestapo" feminine?
Why doesn't the chatan sign the ketubah?
What is the reasoning behind standardization (dividing by standard deviation)?
Extraneous elements in "Europe countries" list
Difficulty understanding group delay concept
When did hardware antialiasing start being available?
Single word to change groups
When should a starting writer get his own webpage?
Can other pieces capture a threatening piece and prevent a checkmate?
What is the difference between something being completely legal and being completely decriminalized?
Jem'Hadar, something strange about their life expectancy
Help with identifying unique aircraft over NE Pennsylvania
Error in master's thesis, I do not know what to do
UK Tourist Visa- Enquiry
How are passwords stolen from companies if they only store hashes?
How can a new country break out from a developed country without war?
How do researchers send unsolicited emails asking for feedback on their works?
grouping a pandas DataFrame with predifined groups
2019 Community Moderator ElectionAdd one row to pandas DataFrameSelecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrame by column nameHow do I get the row count of a Pandas dataframe?How to iterate over rows in a DataFrame in Pandas?Pandas writing dataframe to CSV fileSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headers
I am wondering, how to efficiently do something like groupby when I have predefined groups, and elements might belong to multiple groups at the same time.
Suppose, I have the following DataFrame:
df = pd.DataFrame('value': [0, 2, 4], index=['A', 'B', 'C'])
value
A 0
B 2
C 4
and I have the following predefined groups, which might be overlapping and of different size:
groups = 'group 1': ['A', 'B'],
'group 2': ['A', 'B', 'C']
Now, I want to perform a function on the DataFrame groups. For example, I want to calculate the mean of value for each group.
I was thinking to create an intermediate "expanded" DataFrame on which I could run a groupby:
intermediate_df = pd.DataFrame(columns=['id', 'group', 'value'])
intermediate_df['value'] = intermediate_df['value'].astype(float)
for group, members in groups.items():
for id_ in members:
row = pd.Series([id_, group, df.at[id_, 'value']],
index=['id', 'group', 'value'])
intermediate_df = intermediate_df.append(row, ignore_index=True)
id group value
0 A group 1 0.0
1 B group 1 2.0
2 A group 2 0.0
3 B group 2 2.0
4 C group 2 4.0
Then, I could do
intermediate_df.groupby('group').mean()
which would give me the desired result:
value
group
group 1 1.0
group 2 2.0
Of course, the way I create this intermediate DataFrame is absolutely inefficient. What would be an efficient solution for my problem?
python pandas dataframe pandas-groupby
add a comment |
I am wondering, how to efficiently do something like groupby when I have predefined groups, and elements might belong to multiple groups at the same time.
Suppose, I have the following DataFrame:
df = pd.DataFrame('value': [0, 2, 4], index=['A', 'B', 'C'])
value
A 0
B 2
C 4
and I have the following predefined groups, which might be overlapping and of different size:
groups = 'group 1': ['A', 'B'],
'group 2': ['A', 'B', 'C']
Now, I want to perform a function on the DataFrame groups. For example, I want to calculate the mean of value for each group.
I was thinking to create an intermediate "expanded" DataFrame on which I could run a groupby:
intermediate_df = pd.DataFrame(columns=['id', 'group', 'value'])
intermediate_df['value'] = intermediate_df['value'].astype(float)
for group, members in groups.items():
for id_ in members:
row = pd.Series([id_, group, df.at[id_, 'value']],
index=['id', 'group', 'value'])
intermediate_df = intermediate_df.append(row, ignore_index=True)
id group value
0 A group 1 0.0
1 B group 1 2.0
2 A group 2 0.0
3 B group 2 2.0
4 C group 2 4.0
Then, I could do
intermediate_df.groupby('group').mean()
which would give me the desired result:
value
group
group 1 1.0
group 2 2.0
Of course, the way I create this intermediate DataFrame is absolutely inefficient. What would be an efficient solution for my problem?
python pandas dataframe pandas-groupby
add a comment |
I am wondering, how to efficiently do something like groupby when I have predefined groups, and elements might belong to multiple groups at the same time.
Suppose, I have the following DataFrame:
df = pd.DataFrame('value': [0, 2, 4], index=['A', 'B', 'C'])
value
A 0
B 2
C 4
and I have the following predefined groups, which might be overlapping and of different size:
groups = 'group 1': ['A', 'B'],
'group 2': ['A', 'B', 'C']
Now, I want to perform a function on the DataFrame groups. For example, I want to calculate the mean of value for each group.
I was thinking to create an intermediate "expanded" DataFrame on which I could run a groupby:
intermediate_df = pd.DataFrame(columns=['id', 'group', 'value'])
intermediate_df['value'] = intermediate_df['value'].astype(float)
for group, members in groups.items():
for id_ in members:
row = pd.Series([id_, group, df.at[id_, 'value']],
index=['id', 'group', 'value'])
intermediate_df = intermediate_df.append(row, ignore_index=True)
id group value
0 A group 1 0.0
1 B group 1 2.0
2 A group 2 0.0
3 B group 2 2.0
4 C group 2 4.0
Then, I could do
intermediate_df.groupby('group').mean()
which would give me the desired result:
value
group
group 1 1.0
group 2 2.0
Of course, the way I create this intermediate DataFrame is absolutely inefficient. What would be an efficient solution for my problem?
python pandas dataframe pandas-groupby
I am wondering, how to efficiently do something like groupby when I have predefined groups, and elements might belong to multiple groups at the same time.
Suppose, I have the following DataFrame:
df = pd.DataFrame('value': [0, 2, 4], index=['A', 'B', 'C'])
value
A 0
B 2
C 4
and I have the following predefined groups, which might be overlapping and of different size:
groups = 'group 1': ['A', 'B'],
'group 2': ['A', 'B', 'C']
Now, I want to perform a function on the DataFrame groups. For example, I want to calculate the mean of value for each group.
I was thinking to create an intermediate "expanded" DataFrame on which I could run a groupby:
intermediate_df = pd.DataFrame(columns=['id', 'group', 'value'])
intermediate_df['value'] = intermediate_df['value'].astype(float)
for group, members in groups.items():
for id_ in members:
row = pd.Series([id_, group, df.at[id_, 'value']],
index=['id', 'group', 'value'])
intermediate_df = intermediate_df.append(row, ignore_index=True)
id group value
0 A group 1 0.0
1 B group 1 2.0
2 A group 2 0.0
3 B group 2 2.0
4 C group 2 4.0
Then, I could do
intermediate_df.groupby('group').mean()
which would give me the desired result:
value
group
group 1 1.0
group 2 2.0
Of course, the way I create this intermediate DataFrame is absolutely inefficient. What would be an efficient solution for my problem?
python pandas dataframe pandas-groupby
python pandas dataframe pandas-groupby
edited Mar 8 at 14:10
Fabian Rost
asked Mar 7 at 18:35
Fabian RostFabian Rost
1,241725
1,241725
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
You can create your intermediate_df with Pandas.concat and a list comprehension:
intermediate_df = pd.concat([df.loc[v].assign(group=k) for k, v in groups.items()])
[OUT]
value group
A 0 group 1
B 2 group 1
A 0 group 2
C 4 group 2
add a comment |
Edit try for uneven groups:
pd.DataFrame().from_dict(groups, orient='index').T.stack().map(df.squeeze()).mean(level=1)
You can do it this way also:
pd.DataFrame(groups).stack().map(df.squeeze()).mean(level=1)
Output:
group 1 1
group 2 2
dtype: int64
Thanks! Unfortunately, I forgot to specify that the groups might be of different size. Then, your answer does not work any more, but this was my fault. I updated the question.
– Fabian Rost
Mar 8 at 14:11
1
@FabianRost see update.
– Scott Boston
Mar 8 at 20:39
add a comment |
Building on previous answers, I use list comprehension for an intermediate_df
intermediate_df = pd.DataFrame([[group, id_] for group, members in groups.items() for id_ in members],
columns=['group', 'id']).merge(df, left_on='id', right_index=True)
This seems to be the fastest solution compared to the other answers:
n=10000
m=1000
df = pd.DataFrame('value': np.random.normal(size=n), index=np.arange(n).astype(str))
groups = str(i): list(df.sample(5).index) for i in range(m)
%%timeit
intermediate_df = pd.concat([df.loc[members].assign(group=group) for group, members in groups.items()])
intermediate_df.groupby('group').mean()
948 ms ± 63.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
pd.DataFrame(groups).stack().map(df.squeeze()).mean(level=1)
42.4 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
intermediate_df = pd.DataFrame([[group, id_] for group, members in groups.items() for id_ in members],
columns=['group', 'id']).merge(df, left_on='id', right_index=True)
intermediate_df.groupby('group').mean()
6.13 ms ± 50.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55050634%2fgrouping-a-pandas-dataframe-with-predifined-groups%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can create your intermediate_df with Pandas.concat and a list comprehension:
intermediate_df = pd.concat([df.loc[v].assign(group=k) for k, v in groups.items()])
[OUT]
value group
A 0 group 1
B 2 group 1
A 0 group 2
C 4 group 2
add a comment |
You can create your intermediate_df with Pandas.concat and a list comprehension:
intermediate_df = pd.concat([df.loc[v].assign(group=k) for k, v in groups.items()])
[OUT]
value group
A 0 group 1
B 2 group 1
A 0 group 2
C 4 group 2
add a comment |
You can create your intermediate_df with Pandas.concat and a list comprehension:
intermediate_df = pd.concat([df.loc[v].assign(group=k) for k, v in groups.items()])
[OUT]
value group
A 0 group 1
B 2 group 1
A 0 group 2
C 4 group 2
You can create your intermediate_df with Pandas.concat and a list comprehension:
intermediate_df = pd.concat([df.loc[v].assign(group=k) for k, v in groups.items()])
[OUT]
value group
A 0 group 1
B 2 group 1
A 0 group 2
C 4 group 2
answered Mar 7 at 18:57
Chris AChris A
3,420417
3,420417
add a comment |
add a comment |
Edit try for uneven groups:
pd.DataFrame().from_dict(groups, orient='index').T.stack().map(df.squeeze()).mean(level=1)
You can do it this way also:
pd.DataFrame(groups).stack().map(df.squeeze()).mean(level=1)
Output:
group 1 1
group 2 2
dtype: int64
Thanks! Unfortunately, I forgot to specify that the groups might be of different size. Then, your answer does not work any more, but this was my fault. I updated the question.
– Fabian Rost
Mar 8 at 14:11
1
@FabianRost see update.
– Scott Boston
Mar 8 at 20:39
add a comment |
Edit try for uneven groups:
pd.DataFrame().from_dict(groups, orient='index').T.stack().map(df.squeeze()).mean(level=1)
You can do it this way also:
pd.DataFrame(groups).stack().map(df.squeeze()).mean(level=1)
Output:
group 1 1
group 2 2
dtype: int64
Thanks! Unfortunately, I forgot to specify that the groups might be of different size. Then, your answer does not work any more, but this was my fault. I updated the question.
– Fabian Rost
Mar 8 at 14:11
1
@FabianRost see update.
– Scott Boston
Mar 8 at 20:39
add a comment |
Edit try for uneven groups:
pd.DataFrame().from_dict(groups, orient='index').T.stack().map(df.squeeze()).mean(level=1)
You can do it this way also:
pd.DataFrame(groups).stack().map(df.squeeze()).mean(level=1)
Output:
group 1 1
group 2 2
dtype: int64
Edit try for uneven groups:
pd.DataFrame().from_dict(groups, orient='index').T.stack().map(df.squeeze()).mean(level=1)
You can do it this way also:
pd.DataFrame(groups).stack().map(df.squeeze()).mean(level=1)
Output:
group 1 1
group 2 2
dtype: int64
edited Mar 8 at 20:39
answered Mar 7 at 19:06
Scott BostonScott Boston
56.7k73158
56.7k73158
Thanks! Unfortunately, I forgot to specify that the groups might be of different size. Then, your answer does not work any more, but this was my fault. I updated the question.
– Fabian Rost
Mar 8 at 14:11
1
@FabianRost see update.
– Scott Boston
Mar 8 at 20:39
add a comment |
Thanks! Unfortunately, I forgot to specify that the groups might be of different size. Then, your answer does not work any more, but this was my fault. I updated the question.
– Fabian Rost
Mar 8 at 14:11
1
@FabianRost see update.
– Scott Boston
Mar 8 at 20:39
Thanks! Unfortunately, I forgot to specify that the groups might be of different size. Then, your answer does not work any more, but this was my fault. I updated the question.
– Fabian Rost
Mar 8 at 14:11
Thanks! Unfortunately, I forgot to specify that the groups might be of different size. Then, your answer does not work any more, but this was my fault. I updated the question.
– Fabian Rost
Mar 8 at 14:11
1
1
@FabianRost see update.
– Scott Boston
Mar 8 at 20:39
@FabianRost see update.
– Scott Boston
Mar 8 at 20:39
add a comment |
Building on previous answers, I use list comprehension for an intermediate_df
intermediate_df = pd.DataFrame([[group, id_] for group, members in groups.items() for id_ in members],
columns=['group', 'id']).merge(df, left_on='id', right_index=True)
This seems to be the fastest solution compared to the other answers:
n=10000
m=1000
df = pd.DataFrame('value': np.random.normal(size=n), index=np.arange(n).astype(str))
groups = str(i): list(df.sample(5).index) for i in range(m)
%%timeit
intermediate_df = pd.concat([df.loc[members].assign(group=group) for group, members in groups.items()])
intermediate_df.groupby('group').mean()
948 ms ± 63.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
pd.DataFrame(groups).stack().map(df.squeeze()).mean(level=1)
42.4 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
intermediate_df = pd.DataFrame([[group, id_] for group, members in groups.items() for id_ in members],
columns=['group', 'id']).merge(df, left_on='id', right_index=True)
intermediate_df.groupby('group').mean()
6.13 ms ± 50.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
add a comment |
Building on previous answers, I use list comprehension for an intermediate_df
intermediate_df = pd.DataFrame([[group, id_] for group, members in groups.items() for id_ in members],
columns=['group', 'id']).merge(df, left_on='id', right_index=True)
This seems to be the fastest solution compared to the other answers:
n=10000
m=1000
df = pd.DataFrame('value': np.random.normal(size=n), index=np.arange(n).astype(str))
groups = str(i): list(df.sample(5).index) for i in range(m)
%%timeit
intermediate_df = pd.concat([df.loc[members].assign(group=group) for group, members in groups.items()])
intermediate_df.groupby('group').mean()
948 ms ± 63.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
pd.DataFrame(groups).stack().map(df.squeeze()).mean(level=1)
42.4 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
intermediate_df = pd.DataFrame([[group, id_] for group, members in groups.items() for id_ in members],
columns=['group', 'id']).merge(df, left_on='id', right_index=True)
intermediate_df.groupby('group').mean()
6.13 ms ± 50.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
add a comment |
Building on previous answers, I use list comprehension for an intermediate_df
intermediate_df = pd.DataFrame([[group, id_] for group, members in groups.items() for id_ in members],
columns=['group', 'id']).merge(df, left_on='id', right_index=True)
This seems to be the fastest solution compared to the other answers:
n=10000
m=1000
df = pd.DataFrame('value': np.random.normal(size=n), index=np.arange(n).astype(str))
groups = str(i): list(df.sample(5).index) for i in range(m)
%%timeit
intermediate_df = pd.concat([df.loc[members].assign(group=group) for group, members in groups.items()])
intermediate_df.groupby('group').mean()
948 ms ± 63.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
pd.DataFrame(groups).stack().map(df.squeeze()).mean(level=1)
42.4 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
intermediate_df = pd.DataFrame([[group, id_] for group, members in groups.items() for id_ in members],
columns=['group', 'id']).merge(df, left_on='id', right_index=True)
intermediate_df.groupby('group').mean()
6.13 ms ± 50.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Building on previous answers, I use list comprehension for an intermediate_df
intermediate_df = pd.DataFrame([[group, id_] for group, members in groups.items() for id_ in members],
columns=['group', 'id']).merge(df, left_on='id', right_index=True)
This seems to be the fastest solution compared to the other answers:
n=10000
m=1000
df = pd.DataFrame('value': np.random.normal(size=n), index=np.arange(n).astype(str))
groups = str(i): list(df.sample(5).index) for i in range(m)
%%timeit
intermediate_df = pd.concat([df.loc[members].assign(group=group) for group, members in groups.items()])
intermediate_df.groupby('group').mean()
948 ms ± 63.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
pd.DataFrame(groups).stack().map(df.squeeze()).mean(level=1)
42.4 ms ± 183 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
intermediate_df = pd.DataFrame([[group, id_] for group, members in groups.items() for id_ in members],
columns=['group', 'id']).merge(df, left_on='id', right_index=True)
intermediate_df.groupby('group').mean()
6.13 ms ± 50.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
answered Mar 8 at 14:47
Fabian RostFabian Rost
1,241725
1,241725
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55050634%2fgrouping-a-pandas-dataframe-with-predifined-groups%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown