ElementTree does not seem to get some texts/elements in the tree2019 Community Moderator ElectionHow to get all possible combinations of a list’s elements?Getting the last element of a list in PythonHow to get the number of elements in a list in Python?Python ElementTree: Parsing a string and getting ElementTree instanceHow to get all sub-elements of an element tree with Python ElementTree?Find element by text with XPath in ElementTreeHow to set ElementTree Element text field in the constructorPython ElementTree doesn't seem to recognize text nodesElementTree Returns Element Instead of ElementTreeHow to get the XML element text with ElementTree in Python

If I can solve Sudoku can I solve Travelling Salesman Problem(TSP)? If yes, how?

Combining an idiom with a metonymy

How can I track script which gives me "command not found" right after the login?

Existence of subset with given Hausdorff dimension

Is a party consisting of only a bard, a cleric, and a warlock functional long-term?

Life insurance that covers only simultaneous/dual deaths

Time travel from stationary position?

Brexit - No Deal Rejection

What has been your most complicated TikZ drawing?

Is it possible to upcast ritual spells?

compactness of a set where am I going wrong

What's the meaning of “spike” in the context of “adrenaline spike”?

Have researchers managed to "reverse time"? If so, what does that mean for physics?

A Cautionary Suggestion

How Could an Airship Be Repaired Mid-Flight

Does Mathematica reuse previous computations?

What approach do we need to follow for projects without a test environment?

Do the common programs (for example: "ls", "cat") in Linux and BSD come from the same source code?

The difference between「N分で」and「後N分で」

Co-worker team leader wants to inject his friend's awful software into our development. What should I say to our common boss?

How do anti-virus programs start at Windows boot?

If curse and magic is two sides of the same coin, why the former is forbidden?

Do I need life insurance if I can cover my own funeral costs?

How to change two letters closest to a string and one letter immediately after a string using notepad++

ElementTree does not seem to get some texts/elements in the tree

2019 Community Moderator ElectionHow to get all possible combinations of a list’s elements?Getting the last element of a list in PythonHow to get the number of elements in a list in Python?Python ElementTree: Parsing a string and getting ElementTree instanceHow to get all sub-elements of an element tree with Python ElementTree?Find element by text with XPath in ElementTreeHow to set ElementTree Element text field in the constructorPython ElementTree doesn't seem to recognize text nodesElementTree Returns Element Instead of ElementTreeHow to get the XML element text with ElementTree in Python

I have a wikipedia dump that I want to parse, and I am having some difficulties/mysterious problems while using Python xml parser, ElementTree.

My recent problem is, ElementTree does not seem to find the texts that are actually there. This is an example data:

<page>
 <title>Cengiz Han</title>
 <ns>0</ns>
 <id>10</id>
 <revision>
 <id>20337884</id>
 <parentid>20218916</parentid>
 <timestamp>2019-01-29T14:02:43Z</timestamp>
 <contributor>
 <username>CommonsDelinker</username>
 <id>31545</id>
 </contributor>
 <comment>China_11b.jpg dosyası Map_of_China_1142.jpg ile değiştirildi</comment>
 <model>wikitext</model>
 <format>text/x-wiki</format>
 <text xml:space="preserve">
 ...some long Genghis Khan stuff...
 </text>
</page>

Now when I parse it with this:

for event, elem in et.iterparse('dataset/wiki_test', events=('start', 'end', 'start-ns', 'end-ns')):
 if event == 'start':
 if elem.tag == 'page':
 if len(list(elem)) == 0:
 continue
 title = elem.find('title').text
 if title == None or 'MediaWiki' in title:
 elem.clear()
 continue
 wiki_id = elem.find('id')
 if wiki_id == None:
 elem.clear()
 continue
 wiki_id = wiki_id.text
 revision = elem.find('revision')
 if revision != None:
 print(list(revision))
 text = revision.find('text').text
 print(text)
 if text != None:
 count += 1
 titles += title + 'n'
 page = 'wiki_id': wiki_id, 'title': title, 'text': text.text
 pages += json.dumps(page, ensure_ascii=False) + 'n'
 elem.clear()

revision.find('text').text line seems to find no text for some elements, including the one above, and that some makes like one seventh of my data, which is annoying. This was also the case for page->id for some other entries, in which it claimed that element does not exist at all. I solved that problem by ignoring that ones, but I don't really want to do that, also this error does not make sense to me at all.

Here is another page, which works totally fine.

<page>
 <title>Mustafa Suphi</title>
 <ns>0</ns>
 <id>22</id>
 <revision>
 <id>20077185</id>
 <parentid>20017115</parentid>
 <timestamp>2018-10-14T08:31:32Z</timestamp>
 <contributor>
 <username>Vikiçizer</username>
 <id>90501</id>
 </contributor>
 <comment>/* top */düzeltme [[Vikipedi:AWB|AWB]] ile</comment>
 <model>wikitext</model>
 <format>text/x-wiki</format>
 <text xml:space="preserve">
 ...some Mustafa Suphi stuff...
 </text>
 <sha1>m5finh6h2kr8h2fbtmsatp5fhz1siwq</sha1>
 </revision>
 </page>

What am I doing wrong?

edited Mar 7 at 15:33

asked Mar 7 at 14:13

Burak Özmen

4181421

1

event == 'start' does not guarantee that the whole Element are present. Change to event == 'stop'

– stovfl
Mar 7 at 17:21

Can you please upload a complete XML and explain what is the data structure you want to extract from this XML?

– balderman
Mar 9 at 16:13

add a comment |

I have a wikipedia dump that I want to parse, and I am having some difficulties/mysterious problems while using Python xml parser, ElementTree.

My recent problem is, ElementTree does not seem to find the texts that are actually there. This is an example data:

<page>
 <title>Cengiz Han</title>
 <ns>0</ns>
 <id>10</id>
 <revision>
 <id>20337884</id>
 <parentid>20218916</parentid>
 <timestamp>2019-01-29T14:02:43Z</timestamp>
 <contributor>
 <username>CommonsDelinker</username>
 <id>31545</id>
 </contributor>
 <comment>China_11b.jpg dosyası Map_of_China_1142.jpg ile değiştirildi</comment>
 <model>wikitext</model>
 <format>text/x-wiki</format>
 <text xml:space="preserve">
 ...some long Genghis Khan stuff...
 </text>
</page>

Now when I parse it with this:

for event, elem in et.iterparse('dataset/wiki_test', events=('start', 'end', 'start-ns', 'end-ns')):
 if event == 'start':
 if elem.tag == 'page':
 if len(list(elem)) == 0:
 continue
 title = elem.find('title').text
 if title == None or 'MediaWiki' in title:
 elem.clear()
 continue
 wiki_id = elem.find('id')
 if wiki_id == None:
 elem.clear()
 continue
 wiki_id = wiki_id.text
 revision = elem.find('revision')
 if revision != None:
 print(list(revision))
 text = revision.find('text').text
 print(text)
 if text != None:
 count += 1
 titles += title + 'n'
 page = 'wiki_id': wiki_id, 'title': title, 'text': text.text
 pages += json.dumps(page, ensure_ascii=False) + 'n'
 elem.clear()

Here is another page, which works totally fine.

<page>
 <title>Mustafa Suphi</title>
 <ns>0</ns>
 <id>22</id>
 <revision>
 <id>20077185</id>
 <parentid>20017115</parentid>
 <timestamp>2018-10-14T08:31:32Z</timestamp>
 <contributor>
 <username>Vikiçizer</username>
 <id>90501</id>
 </contributor>
 <comment>/* top */düzeltme [[Vikipedi:AWB|AWB]] ile</comment>
 <model>wikitext</model>
 <format>text/x-wiki</format>
 <text xml:space="preserve">
 ...some Mustafa Suphi stuff...
 </text>
 <sha1>m5finh6h2kr8h2fbtmsatp5fhz1siwq</sha1>
 </revision>
 </page>

What am I doing wrong?

edited Mar 7 at 15:33

asked Mar 7 at 14:13

Burak Özmen

4181421

1

event == 'start' does not guarantee that the whole Element are present. Change to event == 'stop'

– stovfl
Mar 7 at 17:21

Can you please upload a complete XML and explain what is the data structure you want to extract from this XML?

– balderman
Mar 9 at 16:13

add a comment |

I have a wikipedia dump that I want to parse, and I am having some difficulties/mysterious problems while using Python xml parser, ElementTree.

My recent problem is, ElementTree does not seem to find the texts that are actually there. This is an example data:

<page>
 <title>Cengiz Han</title>
 <ns>0</ns>
 <id>10</id>
 <revision>
 <id>20337884</id>
 <parentid>20218916</parentid>
 <timestamp>2019-01-29T14:02:43Z</timestamp>
 <contributor>
 <username>CommonsDelinker</username>
 <id>31545</id>
 </contributor>
 <comment>China_11b.jpg dosyası Map_of_China_1142.jpg ile değiştirildi</comment>
 <model>wikitext</model>
 <format>text/x-wiki</format>
 <text xml:space="preserve">
 ...some long Genghis Khan stuff...
 </text>
</page>

Now when I parse it with this:

for event, elem in et.iterparse('dataset/wiki_test', events=('start', 'end', 'start-ns', 'end-ns')):
 if event == 'start':
 if elem.tag == 'page':
 if len(list(elem)) == 0:
 continue
 title = elem.find('title').text
 if title == None or 'MediaWiki' in title:
 elem.clear()
 continue
 wiki_id = elem.find('id')
 if wiki_id == None:
 elem.clear()
 continue
 wiki_id = wiki_id.text
 revision = elem.find('revision')
 if revision != None:
 print(list(revision))
 text = revision.find('text').text
 print(text)
 if text != None:
 count += 1
 titles += title + 'n'
 page = 'wiki_id': wiki_id, 'title': title, 'text': text.text
 pages += json.dumps(page, ensure_ascii=False) + 'n'
 elem.clear()

Here is another page, which works totally fine.

<page>
 <title>Mustafa Suphi</title>
 <ns>0</ns>
 <id>22</id>
 <revision>
 <id>20077185</id>
 <parentid>20017115</parentid>
 <timestamp>2018-10-14T08:31:32Z</timestamp>
 <contributor>
 <username>Vikiçizer</username>
 <id>90501</id>
 </contributor>
 <comment>/* top */düzeltme [[Vikipedi:AWB|AWB]] ile</comment>
 <model>wikitext</model>
 <format>text/x-wiki</format>
 <text xml:space="preserve">
 ...some Mustafa Suphi stuff...
 </text>
 <sha1>m5finh6h2kr8h2fbtmsatp5fhz1siwq</sha1>
 </revision>
 </page>

What am I doing wrong?

edited Mar 7 at 15:33

asked Mar 7 at 14:13

Burak Özmen

4181421

I have a wikipedia dump that I want to parse, and I am having some difficulties/mysterious problems while using Python xml parser, ElementTree.

My recent problem is, ElementTree does not seem to find the texts that are actually there. This is an example data:

<page>
 <title>Cengiz Han</title>
 <ns>0</ns>
 <id>10</id>
 <revision>
 <id>20337884</id>
 <parentid>20218916</parentid>
 <timestamp>2019-01-29T14:02:43Z</timestamp>
 <contributor>
 <username>CommonsDelinker</username>
 <id>31545</id>
 </contributor>
 <comment>China_11b.jpg dosyası Map_of_China_1142.jpg ile değiştirildi</comment>
 <model>wikitext</model>
 <format>text/x-wiki</format>
 <text xml:space="preserve">
 ...some long Genghis Khan stuff...
 </text>
</page>

Now when I parse it with this:

for event, elem in et.iterparse('dataset/wiki_test', events=('start', 'end', 'start-ns', 'end-ns')):
 if event == 'start':
 if elem.tag == 'page':
 if len(list(elem)) == 0:
 continue
 title = elem.find('title').text
 if title == None or 'MediaWiki' in title:
 elem.clear()
 continue
 wiki_id = elem.find('id')
 if wiki_id == None:
 elem.clear()
 continue
 wiki_id = wiki_id.text
 revision = elem.find('revision')
 if revision != None:
 print(list(revision))
 text = revision.find('text').text
 print(text)
 if text != None:
 count += 1
 titles += title + 'n'
 page = 'wiki_id': wiki_id, 'title': title, 'text': text.text
 pages += json.dumps(page, ensure_ascii=False) + 'n'
 elem.clear()

Here is another page, which works totally fine.

<page>
 <title>Mustafa Suphi</title>
 <ns>0</ns>
 <id>22</id>
 <revision>
 <id>20077185</id>
 <parentid>20017115</parentid>
 <timestamp>2018-10-14T08:31:32Z</timestamp>
 <contributor>
 <username>Vikiçizer</username>
 <id>90501</id>
 </contributor>
 <comment>/* top */düzeltme [[Vikipedi:AWB|AWB]] ile</comment>
 <model>wikitext</model>
 <format>text/x-wiki</format>
 <text xml:space="preserve">
 ...some Mustafa Suphi stuff...
 </text>
 <sha1>m5finh6h2kr8h2fbtmsatp5fhz1siwq</sha1>
 </revision>
 </page>

What am I doing wrong?

python xml python-3.7

edited Mar 7 at 15:33

asked Mar 7 at 14:13

Burak Özmen

4181421

edited Mar 7 at 15:33

asked Mar 7 at 14:13

Burak Özmen

4181421

edited Mar 7 at 15:33

asked Mar 7 at 14:13

Burak Özmen

4181421

asked Mar 7 at 14:13

Burak Özmen

4181421

asked Mar 7 at 14:13

Burak Özmen

4181421

1

event == 'start' does not guarantee that the whole Element are present. Change to event == 'stop'

– stovfl
Mar 7 at 17:21

Can you please upload a complete XML and explain what is the data structure you want to extract from this XML?

– balderman
Mar 9 at 16:13

add a comment |

1

event == 'start' does not guarantee that the whole Element are present. Change to event == 'stop'

– stovfl
Mar 7 at 17:21

Can you please upload a complete XML and explain what is the data structure you want to extract from this XML?

– balderman
Mar 9 at 16:13

event == 'start' does not guarantee that the whole Element are present. Change to event == 'stop'

– stovfl
Mar 7 at 17:21

Can you please upload a complete XML and explain what is the data structure you want to extract from this XML?

– balderman
Mar 9 at 16:13

add a comment |

1 Answer
1

active

oldest

votes

You have posted two examples: "working" and "not working".

In the "not working" one there is no

</revision>

Are you sure this the XML you have or it is just copy & paste mistake.

answered Mar 9 at 7:08

balderman

1,5831317

Copy paste mistake.

– Burak Özmen
Mar 9 at 16:08

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55045863%2felementtree-does-not-seem-to-get-some-texts-elements-in-the-tree%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

You have posted two examples: "working" and "not working".

In the "not working" one there is no

</revision>

Are you sure this the XML you have or it is just copy & paste mistake.

answered Mar 9 at 7:08

balderman

1,5831317

Copy paste mistake.

– Burak Özmen
Mar 9 at 16:08

add a comment |

You have posted two examples: "working" and "not working".

In the "not working" one there is no

</revision>

Are you sure this the XML you have or it is just copy & paste mistake.

answered Mar 9 at 7:08

balderman

1,5831317

Copy paste mistake.

– Burak Özmen
Mar 9 at 16:08

add a comment |

You have posted two examples: "working" and "not working".

In the "not working" one there is no

</revision>

Are you sure this the XML you have or it is just copy & paste mistake.

answered Mar 9 at 7:08

balderman

1,5831317

You have posted two examples: "working" and "not working".

In the "not working" one there is no

</revision>

Are you sure this the XML you have or it is just copy & paste mistake.

answered Mar 9 at 7:08

balderman

1,5831317

answered Mar 9 at 7:08

balderman

1,5831317

answered Mar 9 at 7:08

balderman

1,5831317

answered Mar 9 at 7:08

balderman

1,5831317

Copy paste mistake.

– Burak Özmen
Mar 9 at 16:08

add a comment |

Copy paste mistake.

– Burak Özmen
Mar 9 at 16:08

Copy paste mistake.

– Burak Özmen
Mar 9 at 16:08

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ggtcf

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Thal And Out Agency railway station See also References External links Navigation menuOfficial Web Site of Pakistan RailwaysArchivedOfficial Web Site of Pakistan Railwayseeexpanding ite

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Thal And Out Agency railway station See also References External links Navigation menuOfficial Web Site of Pakistan RailwaysArchivedOfficial Web Site of Pakistan Railwayseeexpanding ite

1 Answer
1

1 Answer
1

1 Answer
1