Parsing the nested XML fields in Spark
- anydataflow
- Jun 16, 2021
- 1 min read
Updated: Jul 20, 2021
processing XML data is challenging which can be solved in spark using udf and xpath

Here we are putting some practical code which can be reused in problem statement to solve xml data challenges.
UDF to Convert XML to JSON
Here showing java udf which takes xml string as input and generate json string in output.
public static void toJsonXml(SparkSession spark) {
UDF1 getJsonXml = new UDF1<String, String>() {
public String call(final String str) throws Exception {
return XML.toJSONObject(str).toString();
}
};
spark.udf().register("toJsonXml",getJsonXml,DataTypes.StringType);
}
2. Using XPATH function:
Here let's assume spark dataset has column "test" which contains xml string and there are some child node which is matching part, adding sample string:
"<excel>
<rowgroups>
<rowgroup>
<summary>Summary1</summary>
<datarows>
<datarow>
<fieldX>fieldxval</fieldX>
<description>
<line>Desc11/</line>
<line>Desc12/</line>
</description>
</datarow>
</datarows>
</rowgroup>
</rowgroups>
</excel>"
df1.selectExpr("xpath(test, 'excel/rowgroups/rowgroup/summary/text()') summary",
"xpath(test, 'excel/rowgroups/rowgroup/datarows/datarow/fieldX/text()') fieldX",
"xpath(test, 'excel/rowgroups/rowgroup/datarows/datarow/description/line/text()') description").show(false);
+--------------------+-----------------------+------------------------------------+
|summary |fieldX |description |
+--------------------+-----------------------+------------------------------------+
|[Summary1, Summary2]|[fieldxval, fieldxval2]|[Desc11/, Desc12/, Desc21/, Desc22/]|
+--------------------+-----------------------+------------------------------------+
by playing with xpath function we can do lot's of thing, for details one can visit spark sql function and check the usage of functions.
Comments