How to get datediff() in seconds in pyspark?Date difference between consecutive rows - Pyspark DataframeHow to merge two dictionaries in a single expression?How do I check if a list is empty?How do I check whether a file exists without exceptions?How can I safely create a nested directory in Python?How to get the current time in PythonHow do I sort a dictionary by value?How do I get the number of elements in a list in Python?How do I list all files of a directory?What is the difference between dict.items() and dict.iteritems()?How can I replace all the NaN values with Zero's in a column of a pandas dataframe
How to draw the figure with four pentagons?
Today is the Center
If human space travel is limited by the G force vulnerability, is there a way to counter G forces?
Arrow those variables!
What is the word for reserving something for yourself before others do?
I would say: "You are another teacher", but she is a woman and I am a man
Is it inappropriate for a student to attend their mentor's dissertation defense?
AES: Why is it a good practice to use only the first 16bytes of a hash for encryption?
Intersection of two sorted vectors in C++
How much of data wrangling is a data scientist's job?
Fully-Firstable Anagram Sets
How can I make my BBEG immortal short of making them a Lich or Vampire?
Did Shadowfax go to Valinor?
How to take photos in burst mode, without vibration?
How can I fix/modify my tub/shower combo so the water comes out of the showerhead?
Anagram holiday
What killed these X2 caps?
Why do I get two different answers for this counting problem?
Why is the 'in' operator throwing an error with a string literal instead of logging false?
What's the difference between 'rename' and 'mv'?
Is it legal for company to use my work email to pretend I still work there?
Modeling an IP Address
What reasons are there for a Capitalist to oppose a 100% inheritance tax?
Is it unprofessional to ask if a job posting on GlassDoor is real?
How to get datediff() in seconds in pyspark?
Date difference between consecutive rows - Pyspark DataframeHow to merge two dictionaries in a single expression?How do I check if a list is empty?How do I check whether a file exists without exceptions?How can I safely create a nested directory in Python?How to get the current time in PythonHow do I sort a dictionary by value?How do I get the number of elements in a list in Python?How do I list all files of a directory?What is the difference between dict.items() and dict.iteritems()?How can I replace all the NaN values with Zero's in a column of a pandas dataframe
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I have tried the code as in (this_post) and cannot get the date difference in seconds. I just take the datediff() between the columns 'Attributes_Timestamp_fix' and 'lagged_date' below.
Any hints?
Below my code and output.
eg = eg.withColumn("lagged_date", lag(eg.Attributes_Timestamp_fix, 1)
.over(Window.partitionBy("id")
.orderBy("Attributes_Timestamp_fix")))
eg = eg.withColumn("time_diff",
datediff(eg.Attributes_Timestamp_fix, eg.lagged_date))
id Attributes_Timestamp_fix time_diff
0 3.531611e+14 2018-04-01 00:01:02 NaN
1 3.531611e+14 2018-04-01 00:01:02 0.0
2 3.531611e+14 2018-04-01 00:03:13 0.0
3 3.531611e+14 2018-04-01 00:03:13 0.0
4 3.531611e+14 2018-04-01 00:03:13 0.0
5 3.531611e+14 2018-04-01 00:03:13 0.0
python apache-spark pyspark datediff
add a comment |
I have tried the code as in (this_post) and cannot get the date difference in seconds. I just take the datediff() between the columns 'Attributes_Timestamp_fix' and 'lagged_date' below.
Any hints?
Below my code and output.
eg = eg.withColumn("lagged_date", lag(eg.Attributes_Timestamp_fix, 1)
.over(Window.partitionBy("id")
.orderBy("Attributes_Timestamp_fix")))
eg = eg.withColumn("time_diff",
datediff(eg.Attributes_Timestamp_fix, eg.lagged_date))
id Attributes_Timestamp_fix time_diff
0 3.531611e+14 2018-04-01 00:01:02 NaN
1 3.531611e+14 2018-04-01 00:01:02 0.0
2 3.531611e+14 2018-04-01 00:03:13 0.0
3 3.531611e+14 2018-04-01 00:03:13 0.0
4 3.531611e+14 2018-04-01 00:03:13 0.0
5 3.531611e+14 2018-04-01 00:03:13 0.0
python apache-spark pyspark datediff
add a comment |
I have tried the code as in (this_post) and cannot get the date difference in seconds. I just take the datediff() between the columns 'Attributes_Timestamp_fix' and 'lagged_date' below.
Any hints?
Below my code and output.
eg = eg.withColumn("lagged_date", lag(eg.Attributes_Timestamp_fix, 1)
.over(Window.partitionBy("id")
.orderBy("Attributes_Timestamp_fix")))
eg = eg.withColumn("time_diff",
datediff(eg.Attributes_Timestamp_fix, eg.lagged_date))
id Attributes_Timestamp_fix time_diff
0 3.531611e+14 2018-04-01 00:01:02 NaN
1 3.531611e+14 2018-04-01 00:01:02 0.0
2 3.531611e+14 2018-04-01 00:03:13 0.0
3 3.531611e+14 2018-04-01 00:03:13 0.0
4 3.531611e+14 2018-04-01 00:03:13 0.0
5 3.531611e+14 2018-04-01 00:03:13 0.0
python apache-spark pyspark datediff
I have tried the code as in (this_post) and cannot get the date difference in seconds. I just take the datediff() between the columns 'Attributes_Timestamp_fix' and 'lagged_date' below.
Any hints?
Below my code and output.
eg = eg.withColumn("lagged_date", lag(eg.Attributes_Timestamp_fix, 1)
.over(Window.partitionBy("id")
.orderBy("Attributes_Timestamp_fix")))
eg = eg.withColumn("time_diff",
datediff(eg.Attributes_Timestamp_fix, eg.lagged_date))
id Attributes_Timestamp_fix time_diff
0 3.531611e+14 2018-04-01 00:01:02 NaN
1 3.531611e+14 2018-04-01 00:01:02 0.0
2 3.531611e+14 2018-04-01 00:03:13 0.0
3 3.531611e+14 2018-04-01 00:03:13 0.0
4 3.531611e+14 2018-04-01 00:03:13 0.0
5 3.531611e+14 2018-04-01 00:03:13 0.0
python apache-spark pyspark datediff
python apache-spark pyspark datediff
edited Mar 10 at 17:52
a_geo
asked Mar 8 at 23:42
a_geoa_geo
474
474
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
In pyspark.sql.functions
, there is a function datediff
that unfortunately only computes differences in days. To overcome this, you can convert both dates in unix timestamps (in seconds) and compute the difference.
Let's create some sample data, compute the lag and then the difference in seconds.
from pyspark.sql.functions import col, lag, unix_timestamp
from pyspark.sql.window import Window
import datetime
d = ['id' : 1, 't' : datetime.datetime(2018,01,01),
'id' : 1, 't' : datetime.datetime(2018,01,02),
'id' : 1, 't' : datetime.datetime(2018,01,04),
'id' : 1, 't' : datetime.datetime(2018,01,07)]
df = spark.createDataFrame(d)
df.show()
+---+-------------------+
| id| t|
+---+-------------------+
| 1|2018-01-01 00:00:00|
| 1|2018-01-02 00:00:00|
| 1|2018-01-04 00:00:00|
| 1|2018-01-07 00:00:00|
+---+-------------------+
w = Window.partitionBy('id').orderBy('t')
df.withColumn("previous_t", lag(df.t, 1).over(w))
.select(df.t, (unix_timestamp(df.t) - unix_timestamp(col('previous_t'))).alias('diff'))
.show()
+-------------------+------+
| t| diff|
+-------------------+------+
|2018-01-01 00:00:00| null|
|2018-01-02 00:00:00| 86400|
|2018-01-04 00:00:00|172800|
|2018-01-07 00:00:00|259200|
+-------------------+------+
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55072513%2fhow-to-get-datediff-in-seconds-in-pyspark%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
In pyspark.sql.functions
, there is a function datediff
that unfortunately only computes differences in days. To overcome this, you can convert both dates in unix timestamps (in seconds) and compute the difference.
Let's create some sample data, compute the lag and then the difference in seconds.
from pyspark.sql.functions import col, lag, unix_timestamp
from pyspark.sql.window import Window
import datetime
d = ['id' : 1, 't' : datetime.datetime(2018,01,01),
'id' : 1, 't' : datetime.datetime(2018,01,02),
'id' : 1, 't' : datetime.datetime(2018,01,04),
'id' : 1, 't' : datetime.datetime(2018,01,07)]
df = spark.createDataFrame(d)
df.show()
+---+-------------------+
| id| t|
+---+-------------------+
| 1|2018-01-01 00:00:00|
| 1|2018-01-02 00:00:00|
| 1|2018-01-04 00:00:00|
| 1|2018-01-07 00:00:00|
+---+-------------------+
w = Window.partitionBy('id').orderBy('t')
df.withColumn("previous_t", lag(df.t, 1).over(w))
.select(df.t, (unix_timestamp(df.t) - unix_timestamp(col('previous_t'))).alias('diff'))
.show()
+-------------------+------+
| t| diff|
+-------------------+------+
|2018-01-01 00:00:00| null|
|2018-01-02 00:00:00| 86400|
|2018-01-04 00:00:00|172800|
|2018-01-07 00:00:00|259200|
+-------------------+------+
add a comment |
In pyspark.sql.functions
, there is a function datediff
that unfortunately only computes differences in days. To overcome this, you can convert both dates in unix timestamps (in seconds) and compute the difference.
Let's create some sample data, compute the lag and then the difference in seconds.
from pyspark.sql.functions import col, lag, unix_timestamp
from pyspark.sql.window import Window
import datetime
d = ['id' : 1, 't' : datetime.datetime(2018,01,01),
'id' : 1, 't' : datetime.datetime(2018,01,02),
'id' : 1, 't' : datetime.datetime(2018,01,04),
'id' : 1, 't' : datetime.datetime(2018,01,07)]
df = spark.createDataFrame(d)
df.show()
+---+-------------------+
| id| t|
+---+-------------------+
| 1|2018-01-01 00:00:00|
| 1|2018-01-02 00:00:00|
| 1|2018-01-04 00:00:00|
| 1|2018-01-07 00:00:00|
+---+-------------------+
w = Window.partitionBy('id').orderBy('t')
df.withColumn("previous_t", lag(df.t, 1).over(w))
.select(df.t, (unix_timestamp(df.t) - unix_timestamp(col('previous_t'))).alias('diff'))
.show()
+-------------------+------+
| t| diff|
+-------------------+------+
|2018-01-01 00:00:00| null|
|2018-01-02 00:00:00| 86400|
|2018-01-04 00:00:00|172800|
|2018-01-07 00:00:00|259200|
+-------------------+------+
add a comment |
In pyspark.sql.functions
, there is a function datediff
that unfortunately only computes differences in days. To overcome this, you can convert both dates in unix timestamps (in seconds) and compute the difference.
Let's create some sample data, compute the lag and then the difference in seconds.
from pyspark.sql.functions import col, lag, unix_timestamp
from pyspark.sql.window import Window
import datetime
d = ['id' : 1, 't' : datetime.datetime(2018,01,01),
'id' : 1, 't' : datetime.datetime(2018,01,02),
'id' : 1, 't' : datetime.datetime(2018,01,04),
'id' : 1, 't' : datetime.datetime(2018,01,07)]
df = spark.createDataFrame(d)
df.show()
+---+-------------------+
| id| t|
+---+-------------------+
| 1|2018-01-01 00:00:00|
| 1|2018-01-02 00:00:00|
| 1|2018-01-04 00:00:00|
| 1|2018-01-07 00:00:00|
+---+-------------------+
w = Window.partitionBy('id').orderBy('t')
df.withColumn("previous_t", lag(df.t, 1).over(w))
.select(df.t, (unix_timestamp(df.t) - unix_timestamp(col('previous_t'))).alias('diff'))
.show()
+-------------------+------+
| t| diff|
+-------------------+------+
|2018-01-01 00:00:00| null|
|2018-01-02 00:00:00| 86400|
|2018-01-04 00:00:00|172800|
|2018-01-07 00:00:00|259200|
+-------------------+------+
In pyspark.sql.functions
, there is a function datediff
that unfortunately only computes differences in days. To overcome this, you can convert both dates in unix timestamps (in seconds) and compute the difference.
Let's create some sample data, compute the lag and then the difference in seconds.
from pyspark.sql.functions import col, lag, unix_timestamp
from pyspark.sql.window import Window
import datetime
d = ['id' : 1, 't' : datetime.datetime(2018,01,01),
'id' : 1, 't' : datetime.datetime(2018,01,02),
'id' : 1, 't' : datetime.datetime(2018,01,04),
'id' : 1, 't' : datetime.datetime(2018,01,07)]
df = spark.createDataFrame(d)
df.show()
+---+-------------------+
| id| t|
+---+-------------------+
| 1|2018-01-01 00:00:00|
| 1|2018-01-02 00:00:00|
| 1|2018-01-04 00:00:00|
| 1|2018-01-07 00:00:00|
+---+-------------------+
w = Window.partitionBy('id').orderBy('t')
df.withColumn("previous_t", lag(df.t, 1).over(w))
.select(df.t, (unix_timestamp(df.t) - unix_timestamp(col('previous_t'))).alias('diff'))
.show()
+-------------------+------+
| t| diff|
+-------------------+------+
|2018-01-01 00:00:00| null|
|2018-01-02 00:00:00| 86400|
|2018-01-04 00:00:00|172800|
|2018-01-07 00:00:00|259200|
+-------------------+------+
answered Mar 10 at 18:39
OliOli
2,2941419
2,2941419
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55072513%2fhow-to-get-datediff-in-seconds-in-pyspark%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown