top of page
  • Twitter Social Icon
  • LinkedIn Social Icon
  • Facebook Social Icon

Parsing the nested XML fields in Spark

  • Writer: anydataflow
    anydataflow
  • Jun 16, 2021
  • 1 min read

Updated: Jul 20, 2021

processing XML data is challenging which can be solved in spark using udf and xpath





Here we are putting some practical code which can be reused in problem statement to solve xml data challenges.


  1. UDF to Convert XML to JSON

Here showing java udf which takes xml string as input and generate json string in output.


public static void toJsonXml(SparkSession spark) {
    UDF1 getJsonXml = new UDF1<String, String>() {
        public String call(final String str) throws Exception {
            return XML.toJSONObject(str).toString();
        }
    };
    spark.udf().register("toJsonXml",getJsonXml,DataTypes.StringType);
}

2. Using XPATH function:


Here let's assume spark dataset has column "test" which contains xml string and there are some child node which is matching part, adding sample string:


"<excel>

<rowgroups>

<rowgroup>

<summary>Summary1</summary>

<datarows>

<datarow>

<fieldX>fieldxval</fieldX>

<description>

<line>Desc11/</line>

<line>Desc12/</line>

</description>

</datarow>

</datarows>

</rowgroup>

</rowgroups>

</excel>"


 
df1.selectExpr("xpath(test, 'excel/rowgroups/rowgroup/summary/text()') summary",
        "xpath(test, 'excel/rowgroups/rowgroup/datarows/datarow/fieldX/text()') fieldX",
        "xpath(test, 'excel/rowgroups/rowgroup/datarows/datarow/description/line/text()') description").show(false);




+--------------------+-----------------------+------------------------------------+
|summary             |fieldX                 |description                         |
+--------------------+-----------------------+------------------------------------+
|[Summary1, Summary2]|[fieldxval, fieldxval2]|[Desc11/, Desc12/, Desc21/, Desc22/]|
+--------------------+-----------------------+------------------------------------+


by playing with xpath function we can do lot's of thing, for details one can visit spark sql function and check the usage of functions.







 
 
 

Comments


SIGN UP AND STAY UPDATED!

Thanks for submitting!

  • Grey Twitter Icon
  • Grey LinkedIn Icon
  • Grey Facebook Icon

© 2023 by Talking Business.  Proudly created with anydataflow.com

bottom of page