如何设置 pandas 的 read_xml 的 `xpath` ?

我想从其 xml file 部分的 Component 解析数据:

<Component>
  <UnderlyingSecurityID>300001</UnderlyingSecurityID>
  <UnderlyingSecurityIDSource>102</UnderlyingSecurityIDSource>
  <UnderlyingSymbol>特锐德</UnderlyingSymbol>
  <ComponentShare>300.00</ComponentShare>
  <SubstituteFlag>1</SubstituteFlag>
  <PremiumRatio>0.25000</PremiumRatio>
  <CreationCashSubstitute>0.0000</CreationCashSubstitute>
  <RedemptionCashSubstitute>0.0000</RedemptionCashSubstitute>
</Component>
<Component>
  <UnderlyingSecurityID>300003</UnderlyingSecurityID>
  <UnderlyingSecurityIDSource>102</UnderlyingSecurityIDSource>
  <UnderlyingSymbol>乐普医疗</UnderlyingSymbol>
  <ComponentShare>600.00</ComponentShare>
  <SubstituteFlag>1</SubstituteFlag>
  <PremiumRatio>0.25000</PremiumRatio>
  <CreationCashSubstitute>0.0000</CreationCashSubstitute>
  <RedemptionCashSubstitute>0.0000</RedemptionCashSubstitute>
</Component>

我已经安装了最新版本的 lxml 和 pandas,尝试了以下代码但没有运气。

Python 3.9.4 (tags/v3.9.4:1f2e308, Apr  6 2021, 13:40:21) [MSC v.1928 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.25.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '1.3.0'

In [3]: xml = pd.read_xml('https://www.huaan.com.cn/etf/159949/etffiledownload.jsp?etffilename=pcf_159949_20210707.xml', xpath='//component')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-67d228028cc9> in <module>
----> 1 xml = pd.read_xml('https://www.huaan.com.cn/etf/159949/etffiledownload.jsp?etffilename=pcf_159949_20210707.xml', xpath='//component')

...
    501         if elems == []:
--> 502             raise ValueError(msg)
    503 
    504         if elems != [] and attrs == [] and children == []:

ValueError: xpath does not return any nodes. Be sure row level nodes are in xpath. If document uses namespaces denoted with xmlns, be sure to define namespaces and use them in xpath.

In [4]: xml = pd.read_xml('https://www.huaan.com.cn/etf/159949/etffiledownload.jsp?etffilename=pcf_159949_20210707.xml', xpath='//component', namespaces={'com': 'http://ts.szse.cn/Fund'})
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-52fbe542dadb> in <module>
----> 1 xml = pd.read_xml('https://www.huaan.com.cn/etf/159949/etffiledownload.jsp?etffilename=pcf_159949_20210707.xml', xpath='//component', namespaces={'com': 'http://ts.szse.cn/Fund'})

...
    501         if elems == []:
--> 502             raise ValueError(msg)
    503 
    504         if elems != [] and attrs == [] and children == []:

ValueError: xpath does not return any nodes. Be sure row level nodes are in xpath. If document uses namespaces denoted with xmlns, be sure to define namespaces and use them in xpath.

我也直接尝试了 lxml ,这似乎可行:

In [5]: from lxml import etree
In [6]: import requests
In [7]: content = requests.get('https://www.huaan.com.cn/etf/159949/etffiledownload.jsp?etffilename=pcf_159949_20210707.xml').content

In [8]: html = etree.HTML(content)
In [9]: html.xpath('//component')
Out[9]: 
[<Element component at 0x1d493cb23c0>,
 <Element component at 0x1d493cb2340>,
 <Element component at 0x1d493cb2240>,
 <Element component at 0x1d493cb22c0>,
 <Element component at 0x1d493cb2140>,
 <Element component at 0x1d493cb2040>,
 <Element component at 0x1d493cb2c40>,
 <Element component at 0x1d493cb61c0>,
 <Element component at 0x1d493cb63c0>,
 <Element component at 0x1d493cb2200>,
 ...

我不知道为什么 read_xml 不起作用。任何帮助,将不胜感激!

stack overflow How to set the `xpath` of pandas's read_xml?
原文答案

答案:

作者头像

所以简而言之,这里的解决方案是找出您想要的节点,在本例中为 Component (区分大小写),然后按如下方式设置 xpath 添加 //

pd.read_xml(your_xml_file, xpath='//Component')
作者头像

您可以使用 xml.etree.ElementTree,而不是 pd.xml_read()

import xml.etree.ElementTree as ET
import pandas as pd
import requests

url = 'https://www.huaan.com.cn/etf/159949/etffiledownload.jsp?etffilename=pcf_159949_20210707.xml'
response = requests.get(url)
res = ET.fromstring(response.content)

tree = ET.ElementTree(res)
root = tree.getroot()

namespace = "{http://ts.szse.cn/Fund}"

columns =['UnderlyingSecurityID', 'UnderlyingSecurityIDSource', 'UnderlyingSymbol', 'ComponentShare', 'SubstituteFlag', 'PremiumRatio','CreationCashSubstitute', 'RedemptionCashSubstitute']

data = []
for elem in root: 
    if elem.tag == f"{namespace}Components":
        com_l = []
        for com in elem.findall(f"{namespace}Component"):
            for val in com:
                com_l.append(val.text)
            data.append(com_l)
            com_l =[]

df = pd.DataFrame(data, columns=columns)
print(df.to_string())