Mastering XML Data Integration in PySpark: Merging, Parsing, and Analyzing Multiple Files with Ease for Data Professionals

7 min readMay 24, 2023

In PySpark, you can use the XML functions provided by the pyspark.sql.functions module to parse and process XML data. These functions enable you to extract information from XML documents, transform XML structures, and perform various operations on XML data. Here are some commonly used XML functions in PySpark:

1. spark.read.format("xml").option(key, value).load(path):

This function is used to read XML files into a DataFrame. It allows you to specify various options such as rowTag (the XML element to be treated as a row), excludeAttributes (attributes to exclude), and others.

Sample Input:

<employees>
  <employee>
    <name>John Doe</name>
    <age>30</age>
    <designation>Manager</designation>
  </employee>
  <employee>
    <name>Jane Smith</name>
    <age>25</age>
    <designation>Engineer</designation>
  </employee>
</employees>

Output DataFrame:

+---------+---+------------+
|name     |age|designation |
+---------+---+------------+
|John Doe |30 |Manager     |
|Jane Smith|25 |Engineer    |
+---------+---+------------+

2. from_xml(col, schema=None):

This function parses an XML string column and returns a struct column representing the XML structure. You can optionally provide a schema to define the expected structure of the XML.

Sample Input:

from pyspark.sql.functions import from_xml

xml_string = """
<employee>
  <name>John Doe</name>
  <age>30</age>
  <designation>Manager</designation>
</employee>
"""
df = spark.createDataFrame([(xml_string,)], ['xml'])
df.show(truncate=False)

Output DataFrame:

+--------------------------------------------------------+
|xml                                                     |
+--------------------------------------------------------+
|<employee><name>John Doe</name><age>30</age><designation>Manager</designation></employee>|
+--------------------------------------------------------+

Applying from_xml function:

df.withColumn('parsed_xml', from_xml('xml')).show(truncate=False)

Output DataFrame with Parsed XML:

+--------------------------------------------------------+--------------------------------------------------------------+
|xml                                                     |parsed_xml                                                    |
+--------------------------------------------------------+--------------------------------------------------------------+
|<employee><name>John Doe</name><age>30</age><designation>Manager</designation></employee>|{name -> John Doe, age -> 30, designation -> Manager}        |
+--------------------------------------------------------+--------------------------------------------------------------+

3. selectExpr("xpath", ...)

This function is used to select data from an XML column using XPath expressions.

Sample Input:

from pyspark.sql.functions import expr

xml_string = """
<employees>
  <employee>
    <name>John Doe</name>
    <age>30</age>
    <designation>Manager</designation>
  </employee>
  <employee>
    <name>Jane Smith</name>
    <age>25</age>
    <designation>Engineer</designation>
  </employee>
</employees>
"""
df = spark.createDataFrame([(xml_string,)], ['xml'])
df.show(truncate=False)

Output DataFrame:

+--------------------------------------------------------------------------------------+
|xml                                                                                   |
+--------------------------------------------------------------------------------------+

Applying selectExpr function with XPath expressions:

df.selectExpr(
    "xpath(xml, '/employees/employee/name/text()') as name",
    "xpath(xml, '/employees/employee/age/text()') as age",
    "xpath(xml, '/employees/employee/designation/text()') as designation"
).show(truncate=False)

Output DataFrame with Selected Data:

+---------+---+------------+
|name     |age|designation |
+---------+---+------------+
|[John Doe]|[30]|[Manager]  |
|[Jane Smith]|[25]|[Engineer] |
+---------+---+------------+

4. explode(col):

This function explodes an array of structs into multiple rows. It is useful when dealing with XML data that contains repeating elements.

Sample Input:

from pyspark.sql.functions import explode

xml_string = """
<employees>
  <employee>
    <name>John Doe</name>
    <age>30</age>
    <designation>Manager</designation>
    <languages>
      <language>Python</language>
      <language>Java</language>
    </languages>
  </employee>
  <employee>
    <name>Jane Smith</name>
    <age>25</age>
    <designation>Engineer</designation>
    <languages>
      <language>Java</language>
      <language>C++</language>
    </languages>
  </employee>
</employees>
"""
df = spark.createDataFrame([(xml_string,)], ['xml'])
df.show(truncate=False)

Output DataFrame:

+-----------------------------------------------------------------------------------------+
|xml                                                                                      |
+-----------------------------------------------------------------------------------------+
|<employees><employee><name>John Doe</name><age>30</age><designation>Manager</designation>...|
+-----------------------------------------------------------------------------------------+

Applying explode function on the "languages" column:

df.withColumn("languages", explode(expr("xpath(xml, '/employees/employee/languages/language/text()')"))).show(truncate=False)

Output DataFrame with Exploded Languages:

+-----------------------------------------------------------------------------------------+--------+
|xml                                                                                      |languages|
+-----------------------------------------------------------------------------------------+--------+
|<employees><employee><name>John Doe</name><age>30</age><designation>Manager</designation>...|Python  |
|<employees><employee><name>John Doe</name><age>30</age><designation>Manager</designation>...|Java    |
|<employees><employee><name>Jane Smith</name><age>25</age><designation>Engineer</designation>...|Java    |
|<employees><employee><name>Jane Smith</name><age>25</age><designation>Engineer</designation>...|C++     |
+-----------------------------------------------------------------------------------------+--------+

5. to_xml(struct, options=None):

This function serializes a struct column into an XML string based on the provided options. It is the inverse of the from_xml function.

Sample Input:

from pyspark.sql.functions import to_xml

df = spark.createDataFrame([("John Doe", 30, "Manager"), ("Jane Smith", 25, "Engineer")], ["name", "age", "designation"])
df.show()

Output DataFrame:

+-----------+---+------------+
|       name|age| designation|
+-----------+---+------------+
|   John Doe| 30|     Manager|
|Jane Smith | 25|    Engineer|
+-----------+---+------------+

Applying to_xml function:

df.select(to_xml(struct(df.columns), options={"rootTag": "employees", "rowTag": "employee"}).alias("xml")).show(truncate=False)

Output DataFrame with XML Serialization:

+--------------------------------------------------------+
|xml                                                     |
+--------------------------------------------------------+
|<employees><employee><name>John Doe</name><age>30</age><designation>Manager</designation></employee></employees>|
|<employees><employee><name>Jane Smith</name><age>25</age><designation>Engineer</designation></employee></employees>|
+--------------------------------------------------------+

6. xpath_string(col, xpath):

This function evaluates an XPath expression against an XML string column and returns the result as a string.

Sample Input:

from pyspark.sql.functions import xpath_string

xml_string = """
<employee>
  <name>John Doe</name>
  <age>30</age>
  <designation>Manager</designation>
</employee>
"""
df = spark.createDataFrame([(xml_string,)], ['xml'])
df.show(truncate=False)

Output DataFrame:

+--------------------------------------------------------+
|xml                                                     |
+--------------------------------------------------------+
|<employee><name>John Doe</name><age>30</age><designation>Manager</designation></employee>|
+--------------------------------------------------------+

Applying xpath_string function:

df.select(xpath_string('xml', '/employee/name/text()').alias('name')).show(truncate=False)

Output DataFrame with XPath Evaluation:

+--------+
|name    |
+--------+
|John Doe|
+--------+

7. xpath_boolean(col, xpath):

This function evaluates an XPath expression against an XML string column and returns the result as a boolean value.

Sample Input:

from pyspark.sql.functions import xpath_boolean

xml_string = """
<employee>
  <name>John Doe</name>
  <age>30</age>
  <designation>Manager</designation>
</employee>
"""
df = spark.createDataFrame([(xml_string,)], ['xml'])
df.show(truncate=False)

Output DataFrame:

+--------------------------------------------------------+
|xml                                                     |
+--------------------------------------------------------+
|<employee><name>John Doe</name><age>30</age><designation>Manager</designation></employee>|
+--------------------------------------------------------+

Applying xpath_boolean function:

df.select(xpath_boolean('xml', '/employee/age/text() = 30').alias('is_age_30')).show(truncate=False)

Output DataFrame with XPath Evaluation:

+---------+
|is_age_30|
+---------+
|true     |
+---------+

8. xpath_int(col, xpath):

This function evaluates an XPath expression against an XML string column and returns the result as an integer value.

Sample Input:

from pyspark.sql.functions import xpath_int

xml_string = """
<employee>
  <name>John Doe</name>
  <age>30</age>
  <designation>Manager</designation>
</employee>
"""
df = spark.createDataFrame([(xml_string,)], ['xml'])
df.show(truncate=False)

Output DataFrame:

+--------------------------------------------------------+
|xml                                                     |
+--------------------------------------------------------+
|<employee><name>John Doe</name><age>30</age><designation>Manager</designation></employee>|
+--------------------------------------------------------+

Applying xpath_int function:

df.select(xpath_int('xml', '/employee/age/text()').alias('age')).show(truncate=False)

Output DataFrame with XPath Evaluation:

+---+
|age|
+---+
|30 |
+---+

9. xpath_float(col, xpath):

This function evaluates an XPath expression against an XML string column and returns the result as a float value.

Sample Input:

from pyspark.sql.functions import xpath_float

xml_string = """
<employee>
  <name>John Doe</name>
  <salary>5000.50</salary>
</employee>
"""
df = spark.createDataFrame([(xml_string,)], ['xml'])
df.show(truncate=False)

Output DataFrame:

+------------------------------------+
|xml                                 |
+------------------------------------+
|<employee><name>John Doe</name><salary>5000.50</salary></employee>|
+------------------------------------+

Applying xpath_float function:

df.select(xpath_float('xml', '/employee/salary/text()').alias('salary')).show(truncate=False)

Output DataFrame with XPath Evaluation:

+-------+
|salary |
+-------+
|5000.5 |
+-------+

Nested XML

Nested XML refers to XML structures where elements can have child elements, creating a hierarchical or nested structure. Here’s an example of nested XML and how you can work with it in PySpark:

Sample Nested XML:

<company>
  <department>
    <name>HR</name>
    <employees>
      <employee>
        <id>1001</id>
        <name>John Doe</name>
        <position>Manager</position>
      </employee>
      <employee>
        <id>1002</id>
        <name>Jane Smith</name>
        <position>Engineer</position>
      </employee>
    </employees>
  </department>
  <department>
    <name>IT</name>
    <employees>
      <employee>
        <id>2001</id>
        <name>Robert Johnson</name>
        <position>System Analyst</position>
      </employee>
      <employee>
        <id>2002</id>
        <name>Sarah Thompson</name>
        <position>Developer</position>
      </employee>
    </employees>
  </department>
</company>

To work with nested XML in PySpark, you can utilize the XML functions and DataFrame operations to extract and process the nested elements.

Reading Nested XML into DataFrame:

df = spark.read.format("xml").option("rowTag", "department").load("nested.xml")
df.printSchema()
df.show(truncate=False)

Output DataFrame Schema:

root
 |-- name: string (nullable = true)
 |-- employees: struct (nullable = true)
 |    |-- employee: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- id: long (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- position: string (nullable = true)

Output DataFrame Contents:

+----+------------------------------------------------------+
|name|employees                                             |
+----+------------------------------------------------------+
|HR  |[[[1001, John Doe, Manager], [1002, Jane Smith, Engineer]]]|
|IT  |[[[2001, Robert Johnson, System Analyst], [2002, Sarah Thompson, Developer]]]|
+----+------------------------------------------------------+

Flattening the Nested DataFrame: To work with the nested structure, you can flatten the DataFrame using the explode function and select the required columns.

from pyspark.sql.functions import explode

# Flatten the 'employees' array column
df_flat = df.withColumn("employee", explode("employees.employee"))

# Select required columns
df_flat.select("name", "employee.id", "employee.name", "employee.position").show(truncate=False)

Output Flattened DataFrame:

+----+----+----------------+----------------+
|name|id  |name            |position        |
+----+----+----------------+----------------+
|HR  |1001|John Doe        |Manager         |
|HR  |1002|Jane Smith      |Engineer        |
|IT  |2001|Robert Johnson  |System Analyst  |
|IT  |2002|Sarah Thompson  |Developer       |
+----+----+----------------+----------------+

By flattening the nested XML structure, you can access the nested elements and perform operations or transformations on them using DataFrame functions in PySpark.

Accessing Nested Elements: You can access nested elements by using dot notation to traverse the nested structure. For example, to access the name and position of employees in the IT department:

df.select("name", "employees.employee.name", "employees.employee.position").show(truncate=False)

Output:

+----+------------------+-------------------+
|name|employees.employee.name|employees.employee.position|
+----+------------------+-------------------+
|HR  |[John Doe, Jane Smith]   |[Manager, Engineer]         |
|IT  |[Robert Johnson, Sarah Thompson]|[System Analyst, Developer]|
+----+------------------+-------------------+

2. Filtering Based on Nested Elements: You can filter rows based on nested elements using the filter function. For example, to filter employees in the IT department:

df.filter("name = 'IT'").select("employees.employee.name", "employees.employee.position").show(truncate=False)

Output:

+------------------+-------------------+
|name              |position           |
+------------------+-------------------+
|[Robert Johnson, Sarah Thompson]|[System Analyst, Developer]|
+------------------+-------------------+

3. Aggregating Nested Elements: You can aggregate nested elements using functions like collect_list or collect_set. For example, to collect a list of employee names in each department:

from pyspark.sql.functions import collect_list
df.groupby("name").agg(collect_list("employees.employee.name").alias("employee_names")).show(truncate=False)

Output:

+----+------------------------+
|name|employee_names           |
+----+------------------------+
|HR  |[John Doe, Jane Smith]       |
|IT  |[Robert Johnson, Sarah Thompson]|
+----+------------------------+

4. Writing Nested XML: You can write nested XML data using the to_xml function, similar to the previous explanation. For example, to write the nested DataFrame to XML:

df.write.format("xml").option("rootTag", "company").option("rowTag", "department").save("output.xml")

This will save the DataFrame as nested XML in the specified file.

These operations demonstrate how you can access, filter, aggregate, and write nested XML data in PySpark. You can use DataFrame functions and SQL expressions to manipulate the nested XML structure according to your requirements.

Working with Multiple XML Files in PySpark

When working with multiple XML files in PySpark, you can add them together and parse their contents using the read.format("xml") function. The approach will vary depending on whether the XML files have the same structure or different structures with common keys. Let's explore both scenarios:

XML Files with the Same Structure: If the XML files have the same structure, you can simply provide a comma-separated list of file paths when reading the files. PySpark will automatically combine the data from all the files into a single DataFrame.

Sample Code:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Read XML files with the same structure
df = spark.read.format("xml").option("rowTag", "employee").load("file1.xml, file2.xml, file3.xml")

# Display the contents
df.show()

This code reads the XML files file1.xml, file2.xml, and file3.xml, which have the same structure defined by the employee row tag. The data from all the files will be combined into a single DataFrame df, allowing you to perform further transformations and analysis.

2. XML Files with Different Structures but Common Keys: If the XML files have different structures but contain common keys, you can use a wildcard path to read all the files with a similar pattern. Then, you can extract the common keys and combine the data using DataFrame operations.

Sample Code:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Read XML files with different structures but common keys
df1 = spark.read.format("xml").option("rowTag", "employee").load("employees1*.xml")
df2 = spark.read.format("xml").option("rowTag", "employee").load("employees2*.xml")

# Extract common keys and combine data
common_keys = ["id", "name", "position"]
combined_df = df1.select(common_keys).union(df2.select(common_keys))

# Display the contents
combined_df.show()

In this example, XML files matching the patterns employees1*.xml and employees2*.xml are read using the wildcard path. Although the structures may differ, they have common keys defined as id, name, and position. By selecting the common keys from both DataFrames and using the union operation, the data is combined into a single DataFrame combined_df.

You can then perform further operations on the combined_df, such as filtering, aggregating, or writing the merged XML data as per your requirements.

These approaches allow you to add and parse multiple XML files in PySpark, whether they have the same structure or different structures with common keys. By leveraging PySpark’s DataFrame operations, you can manipulate and analyze the combined XML data efficiently.

Mastering XML Data Integration in PySpark: Merging, Parsing, and Analyzing Multiple Files with Ease for Data Professionals

Nested XML

Working with Multiple XML Files in PySpark

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Pawan Kumar Ganjhu

No responses yet

More from Pawan Kumar Ganjhu

PySpark for Data Engineering Beginners: An Extensive Guide

Introduction

Enhancing Airflow Task Monitoring with Email Notifications

Introduction:

Exploring PySpark: Memory Management, Resource Control, and Database Interactions

Introduction:

PySpark Examples: Real-time, Batch, and Stream Processing for Data Professionals

Real-time processing, batch processing, and stream processing are three different approaches to handling data in various applications…

Recommended from Medium

Handling Large Data Volumes (100GB — 1TB) in PySpark

Processing large volumes of data efficiently is crucial for businesses dealing with analytics, machine learning, and real-time data…

Python Meets SQLGlot: Seamlessly Parse and Convert SQL Queries

Have you ever found yourself jumping between SQL dialects — MySQL, SparkSQL, PostgreSQL, Hive — and getting tripped up by subtle syntax…

Lambda functions and built-in functions (Python and Pyspark)

Why Parquet Is So Much Faster Than CSV in PySpark (with Real Numbers)

A few days ago, I was running some data transformation jobs in PySpark. I started with a CSV file — loaded it, applied a few…

Optimizing PySpark Jobs: Best Practices and Techniques

1. Understanding Spark’s Execution Model

PySpark vs. Spark SQL: A Love-Hate Relationship

I don’t know who this might offend, but I have to say it — I hate PySpark.