PySpark 2.2.0 : 'numpy.ndarray' object has no attribute 'indices'How to sort a list of objects based on an attribute of the objects?How to know if an object has an attribute in PythonDetermine the type of an object?How to get a value from the Row object in Spark Dataframe?Count number of elements in each pyspark RDD partitionPySpark mllib Logistic Regression error “List object has no attribute first”Parse JSON Data and save to MongoDB in PySparkdataframe to rdd python / spark / pysparkUnsure how to reproduce python code on pysparkTimedelta in Pyspark Dataframes

PySpark 2.2.0 : 'numpy.ndarray' object has no attribute 'indices'How to sort a list of objects based on an attribute of the objects?How to know if an object has an attribute in PythonDetermine the type of an object?How to get a value from the Row object in Spark Dataframe?Count number of elements in each pyspark RDD partitionPySpark mllib Logistic Regression error “List object has no attribute first”Parse JSON Data and save to MongoDB in PySparkdataframe to rdd python / spark / pysparkUnsure how to reproduce python code on pysparkTimedelta in Pyspark Dataframes - TypeError

Does the Crossbow Expert feat's extra crossbow attack work with the reaction attack from a Hunter ranger's Giant Killer feature?

Sigmoid with a slope but no asymptotes?

Air travel with refrigerated insulin

Alignment of six matrices

Typing CO_2 easily

What should be the ideal length of sentences in a blog post for ease of reading?

Animation: customize bounce interpolation

Would a primitive species be able to learn English from reading books alone?

What does "tick" mean in this sentence?

Sound waves in different octaves

How much do grades matter for a future academia position?

Telemetry for feature health

In One Punch Man, is King actually weak?

How to leave product feedback on macOS?

Quoting Keynes in a lecture

Why is the Sun approximated as a black body at ~ 5800 K?

Would this string work as string?

How to get directions in deep space?

How to make a list of partial sums using forEach

Can I cause damage to electrical appliances by unplugging them when they are turned on?

Check if object is null and return null

Are Captain Marvel's powers affected by Thanos breaking the Tesseract and claiming the stone?

Overlapping circles covering polygon

Proving an identity involving cross products and coplanar vectors

PySpark 2.2.0 : 'numpy.ndarray' object has no attribute 'indices'

How to sort a list of objects based on an attribute of the objects?How to know if an object has an attribute in PythonDetermine the type of an object?How to get a value from the Row object in Spark Dataframe?Count number of elements in each pyspark RDD partitionPySpark mllib Logistic Regression error “List object has no attribute first”Parse JSON Data and save to MongoDB in PySparkdataframe to rdd python / spark / pysparkUnsure how to reproduce python code on pysparkTimedelta in Pyspark Dataframes - TypeError

Task

I'm calculating the size on the indices within a __SparseVector__ using Python API for Spark (PySpark).

Script

def score_clustering(dataframe):
assembler = VectorAssembler(inputCols = dataframe.drop("documento").columns, outputCol = "variables")
data_transformed = assembler.transform(dataframe)
data_transformed_rdd = data_transformed.select("documento", "variables").orderBy(data_transformed.documento.asc()).rdd
count_variables = data_transformed_rdd.map(lambda row : [row[0], row[1].indices.size]).toDF(["id", "frequency"])

Issue

When I execute the action __.count()__ on the __count_variables__ dataframe an error shows up:

AttributeError: 'numpy.ndarray' object has no attribute 'indices'

The main part to consider is:

data_transformed_rdd.map(lambda row : [row[0], row[1].indices.size]).toDF(["id", "frequency"])

I believe this chunk has to do with the error, but I cannot understand why the exception is telling about __numpy.ndarray__ if I'm doing the calculations through mapping that __lambda expression__ whose taking as argument a __SparseVector__ (created with the __assembler__).

Any suggestions? Does anyone maybe know what I'm doing wrong?

edited Mar 9 at 13:26

asked Mar 7 at 22:03

David Arango Sampayo

212

add a comment |

Task

I'm calculating the size on the indices within a __SparseVector__ using Python API for Spark (PySpark).

Script

def score_clustering(dataframe):
assembler = VectorAssembler(inputCols = dataframe.drop("documento").columns, outputCol = "variables")
data_transformed = assembler.transform(dataframe)
data_transformed_rdd = data_transformed.select("documento", "variables").orderBy(data_transformed.documento.asc()).rdd
count_variables = data_transformed_rdd.map(lambda row : [row[0], row[1].indices.size]).toDF(["id", "frequency"])

Issue

When I execute the action __.count()__ on the __count_variables__ dataframe an error shows up:

AttributeError: 'numpy.ndarray' object has no attribute 'indices'

The main part to consider is:

data_transformed_rdd.map(lambda row : [row[0], row[1].indices.size]).toDF(["id", "frequency"])

Any suggestions? Does anyone maybe know what I'm doing wrong?

edited Mar 9 at 13:26

asked Mar 7 at 22:03

David Arango Sampayo

212

add a comment |

Task

I'm calculating the size on the indices within a __SparseVector__ using Python API for Spark (PySpark).

Script

def score_clustering(dataframe):
assembler = VectorAssembler(inputCols = dataframe.drop("documento").columns, outputCol = "variables")
data_transformed = assembler.transform(dataframe)
data_transformed_rdd = data_transformed.select("documento", "variables").orderBy(data_transformed.documento.asc()).rdd
count_variables = data_transformed_rdd.map(lambda row : [row[0], row[1].indices.size]).toDF(["id", "frequency"])

Issue

When I execute the action __.count()__ on the __count_variables__ dataframe an error shows up:

AttributeError: 'numpy.ndarray' object has no attribute 'indices'

The main part to consider is:

data_transformed_rdd.map(lambda row : [row[0], row[1].indices.size]).toDF(["id", "frequency"])

Any suggestions? Does anyone maybe know what I'm doing wrong?

edited Mar 9 at 13:26

asked Mar 7 at 22:03

David Arango Sampayo

212

Task

I'm calculating the size on the indices within a __SparseVector__ using Python API for Spark (PySpark).

Script

def score_clustering(dataframe):
assembler = VectorAssembler(inputCols = dataframe.drop("documento").columns, outputCol = "variables")
data_transformed = assembler.transform(dataframe)
data_transformed_rdd = data_transformed.select("documento", "variables").orderBy(data_transformed.documento.asc()).rdd
count_variables = data_transformed_rdd.map(lambda row : [row[0], row[1].indices.size]).toDF(["id", "frequency"])

Issue

When I execute the action __.count()__ on the __count_variables__ dataframe an error shows up:

AttributeError: 'numpy.ndarray' object has no attribute 'indices'

The main part to consider is:

data_transformed_rdd.map(lambda row : [row[0], row[1].indices.size]).toDF(["id", "frequency"])

Any suggestions? Does anyone maybe know what I'm doing wrong?

python pyspark

edited Mar 9 at 13:26

asked Mar 7 at 22:03

David Arango Sampayo

212

edited Mar 9 at 13:26

asked Mar 7 at 22:03

David Arango Sampayo

212

edited Mar 9 at 13:26

asked Mar 7 at 22:03

David Arango Sampayo

212

asked Mar 7 at 22:03

David Arango Sampayo

212

asked Mar 7 at 22:03

David Arango Sampayo

212

add a comment |

1 Answer
1

active

oldest

votes

There are two problems here. The first one is in indices.size call, indices and size are two different attributes of SparseVector class, size is the complete vector size and indices are the vector indices whose values are non-zero, but size is not a indices attribute. So, assuming that all your vectors are instances of SparseVector class:

from pyspark.ml.linalg import Vectors

df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
 (1, Vectors.sparse(4, [], [])),
 (3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))],
 ["documento", "variables"])

df.show()

+---------+--------------------+
|documento| variables|
+---------+--------------------+
| 0|(4,[0,1],[11.0,2.0])|
| 1| (4,[],[])|
| 3|(4,[0,1,2],[2.0,2...|
+---------+--------------------+

The solution is len function:

df = df.rdd.map(lambda x: (x[0], x[1], len(x[1].indices)))
 .toDF(["documento", "variables", "frecuencia"])
df.show() 
+---------+--------------------+----------+
|documento| variables|frecuencia|
+---------+--------------------+----------+
| 0|(4,[0,1],[11.0,2.0])| 2|
| 1| (4,[],[])| 0|
| 3|(4,[0,1,2],[2.0,2...| 3|
+---------+--------------------+----------+

And here comes the second problem: VectorAssembler does not always generate SparseVectors, depending on what is more efficient, SparseVector or DenseVectors can be generated (based on the number of zeros that your original vector has). For example, suppose the next data frame:

df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
 (1, Vectors.dense([1., 1., 1., 1.])),
 (3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))], 
 ["documento", "variables"])

df.show() 
+---------+--------------------+
|documento| variables|
+---------+--------------------+
| 0|(4,[0,1],[11.0,2.0])|
| 1| [1.0,1.0,1.0,1.0]|
| 3|(4,[0,1,2],[2.0,2...|
+---------+--------------------+

The document 1 is a DenseVector and the previos solution does not work because DenseVectors has not indices attribute, so you have to use a more general representation of vectors to work with a DataFrame which contains both sparse and dense vectors, for example numpy:

import numpy as np
df = df.rdd.map(lambda x: (x[0], 
 x[1], 
 np.nonzero(x[1])[0].size))
 .toDF(["documento", "variables", "frecuencia"])
df.show() 
+---------+--------------------+----------+
|documento| variables|frecuencia|
+---------+--------------------+----------+
| 0|(4,[0,1],[11.0,2.0])| 2|
| 1| [1.0,1.0,1.0,1.0]| 4|
| 3|(4,[0,1,2],[2.0,2...| 3|
+---------+--------------------+----------+

edited Mar 14 at 20:50

answered Mar 14 at 19:33

Amanda

3611314

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55053521%2fpyspark-2-2-0-numpy-ndarray-object-has-no-attribute-indices%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

from pyspark.ml.linalg import Vectors

df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
 (1, Vectors.sparse(4, [], [])),
 (3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))],
 ["documento", "variables"])

df.show()

+---------+--------------------+
|documento| variables|
+---------+--------------------+
| 0|(4,[0,1],[11.0,2.0])|
| 1| (4,[],[])|
| 3|(4,[0,1,2],[2.0,2...|
+---------+--------------------+

The solution is len function:

df = df.rdd.map(lambda x: (x[0], x[1], len(x[1].indices)))
 .toDF(["documento", "variables", "frecuencia"])
df.show() 
+---------+--------------------+----------+
|documento| variables|frecuencia|
+---------+--------------------+----------+
| 0|(4,[0,1],[11.0,2.0])| 2|
| 1| (4,[],[])| 0|
| 3|(4,[0,1,2],[2.0,2...| 3|
+---------+--------------------+----------+

df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
 (1, Vectors.dense([1., 1., 1., 1.])),
 (3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))], 
 ["documento", "variables"])

df.show() 
+---------+--------------------+
|documento| variables|
+---------+--------------------+
| 0|(4,[0,1],[11.0,2.0])|
| 1| [1.0,1.0,1.0,1.0]|
| 3|(4,[0,1,2],[2.0,2...|
+---------+--------------------+

import numpy as np
df = df.rdd.map(lambda x: (x[0], 
 x[1], 
 np.nonzero(x[1])[0].size))
 .toDF(["documento", "variables", "frecuencia"])
df.show() 
+---------+--------------------+----------+
|documento| variables|frecuencia|
+---------+--------------------+----------+
| 0|(4,[0,1],[11.0,2.0])| 2|
| 1| [1.0,1.0,1.0,1.0]| 4|
| 3|(4,[0,1,2],[2.0,2...| 3|
+---------+--------------------+----------+

edited Mar 14 at 20:50

answered Mar 14 at 19:33

Amanda

3611314

add a comment |

from pyspark.ml.linalg import Vectors

df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
 (1, Vectors.sparse(4, [], [])),
 (3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))],
 ["documento", "variables"])

df.show()

+---------+--------------------+
|documento| variables|
+---------+--------------------+
| 0|(4,[0,1],[11.0,2.0])|
| 1| (4,[],[])|
| 3|(4,[0,1,2],[2.0,2...|
+---------+--------------------+

The solution is len function:

df = df.rdd.map(lambda x: (x[0], x[1], len(x[1].indices)))
 .toDF(["documento", "variables", "frecuencia"])
df.show() 
+---------+--------------------+----------+
|documento| variables|frecuencia|
+---------+--------------------+----------+
| 0|(4,[0,1],[11.0,2.0])| 2|
| 1| (4,[],[])| 0|
| 3|(4,[0,1,2],[2.0,2...| 3|
+---------+--------------------+----------+

df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
 (1, Vectors.dense([1., 1., 1., 1.])),
 (3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))], 
 ["documento", "variables"])

df.show() 
+---------+--------------------+
|documento| variables|
+---------+--------------------+
| 0|(4,[0,1],[11.0,2.0])|
| 1| [1.0,1.0,1.0,1.0]|
| 3|(4,[0,1,2],[2.0,2...|
+---------+--------------------+

import numpy as np
df = df.rdd.map(lambda x: (x[0], 
 x[1], 
 np.nonzero(x[1])[0].size))
 .toDF(["documento", "variables", "frecuencia"])
df.show() 
+---------+--------------------+----------+
|documento| variables|frecuencia|
+---------+--------------------+----------+
| 0|(4,[0,1],[11.0,2.0])| 2|
| 1| [1.0,1.0,1.0,1.0]| 4|
| 3|(4,[0,1,2],[2.0,2...| 3|
+---------+--------------------+----------+

edited Mar 14 at 20:50

answered Mar 14 at 19:33

Amanda

3611314

add a comment |

from pyspark.ml.linalg import Vectors

df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
 (1, Vectors.sparse(4, [], [])),
 (3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))],
 ["documento", "variables"])

df.show()

+---------+--------------------+
|documento| variables|
+---------+--------------------+
| 0|(4,[0,1],[11.0,2.0])|
| 1| (4,[],[])|
| 3|(4,[0,1,2],[2.0,2...|
+---------+--------------------+

The solution is len function:

df = df.rdd.map(lambda x: (x[0], x[1], len(x[1].indices)))
 .toDF(["documento", "variables", "frecuencia"])
df.show() 
+---------+--------------------+----------+
|documento| variables|frecuencia|
+---------+--------------------+----------+
| 0|(4,[0,1],[11.0,2.0])| 2|
| 1| (4,[],[])| 0|
| 3|(4,[0,1,2],[2.0,2...| 3|
+---------+--------------------+----------+

df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
 (1, Vectors.dense([1., 1., 1., 1.])),
 (3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))], 
 ["documento", "variables"])

df.show() 
+---------+--------------------+
|documento| variables|
+---------+--------------------+
| 0|(4,[0,1],[11.0,2.0])|
| 1| [1.0,1.0,1.0,1.0]|
| 3|(4,[0,1,2],[2.0,2...|
+---------+--------------------+

import numpy as np
df = df.rdd.map(lambda x: (x[0], 
 x[1], 
 np.nonzero(x[1])[0].size))
 .toDF(["documento", "variables", "frecuencia"])
df.show() 
+---------+--------------------+----------+
|documento| variables|frecuencia|
+---------+--------------------+----------+
| 0|(4,[0,1],[11.0,2.0])| 2|
| 1| [1.0,1.0,1.0,1.0]| 4|
| 3|(4,[0,1,2],[2.0,2...| 3|
+---------+--------------------+----------+

edited Mar 14 at 20:50

answered Mar 14 at 19:33

Amanda

3611314

from pyspark.ml.linalg import Vectors

df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
 (1, Vectors.sparse(4, [], [])),
 (3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))],
 ["documento", "variables"])

df.show()

+---------+--------------------+
|documento| variables|
+---------+--------------------+
| 0|(4,[0,1],[11.0,2.0])|
| 1| (4,[],[])|
| 3|(4,[0,1,2],[2.0,2...|
+---------+--------------------+

The solution is len function:

df = df.rdd.map(lambda x: (x[0], x[1], len(x[1].indices)))
 .toDF(["documento", "variables", "frecuencia"])
df.show() 
+---------+--------------------+----------+
|documento| variables|frecuencia|
+---------+--------------------+----------+
| 0|(4,[0,1],[11.0,2.0])| 2|
| 1| (4,[],[])| 0|
| 3|(4,[0,1,2],[2.0,2...| 3|
+---------+--------------------+----------+

df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
 (1, Vectors.dense([1., 1., 1., 1.])),
 (3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))], 
 ["documento", "variables"])

df.show() 
+---------+--------------------+
|documento| variables|
+---------+--------------------+
| 0|(4,[0,1],[11.0,2.0])|
| 1| [1.0,1.0,1.0,1.0]|
| 3|(4,[0,1,2],[2.0,2...|
+---------+--------------------+

import numpy as np
df = df.rdd.map(lambda x: (x[0], 
 x[1], 
 np.nonzero(x[1])[0].size))
 .toDF(["documento", "variables", "frecuencia"])
df.show() 
+---------+--------------------+----------+
|documento| variables|frecuencia|
+---------+--------------------+----------+
| 0|(4,[0,1],[11.0,2.0])| 2|
| 1| [1.0,1.0,1.0,1.0]| 4|
| 3|(4,[0,1,2],[2.0,2...| 3|
+---------+--------------------+----------+

edited Mar 14 at 20:50

answered Mar 14 at 19:33

Amanda

3611314

edited Mar 14 at 20:50

answered Mar 14 at 19:33

Amanda

3611314

answered Mar 14 at 19:33

Amanda

3611314

answered Mar 14 at 19:33

Amanda

3611314

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

TUrO0,P7bItdjB

搜尋此網誌

Ggtcf

Task

Script

Issue

Task

Script

Issue

Task

Script

Issue

Task

Script

Issue

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Can't initialize raids on a new ASUS Prime B360M-A motherboard2019 Community Moderator ElectionSimilar to RAID config yet more like mirroring solution?Can't get motherboard serial numberWhy does the BIOS entry point start with a WBINVD instruction?UEFI performance Asus Maximus V Extreme

Task

Script

Issue

Task

Script

Issue

Task

Script

Issue

Task

Script

Issue

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Can't initialize raids on a new ASUS Prime B360M-A motherboard2019 Community Moderator ElectionSimilar to RAID config yet more like mirroring solution?Can't get motherboard serial numberWhy does the BIOS entry point start with a WBINVD instruction?UEFI performance Asus Maximus V Extreme

1 Answer
1

1 Answer
1

1 Answer
1