Blogs From The Inside Series.
Reproduced with kind permission from the blog of Ashvini Sharma (MSFT)
Thanks to Wenyang for writing all this up! Just posting it on her behalf.
Summary
InfoPath forms can be saved to XML, these XML Files can later be used in SSIS XMLSource adapter to pull out the data in tables and columns. However, there are some common problems you may meet in these scenarios. This article describes how to work around these potential problems. The issues mentioned in this article is not only specific to InfoPath files, it can also be referenced in other similar situations as well.
Stripping out multiple spaces
XMLSource does not support multiple namespaces. For example, for a XML saved from InfoPath, it will typically look like
<?xml version="1.0" encoding="UTF-8" ?>
<?mso-infoPathSolution PIVersion="1.0.0.0" solutionVersion="1.0.0.1" name="urn:schemas-microsoft-com:office:infopath:oob:ExpenseReportDomestic:1033" productVersion="11.0.6357" ?>
<?mso-application progid="InfoPath.Document"?>
<exp:expenseReport xmlns:exp="http://schemas.microsoft.com/office/infopath/2003/sample/ExpenseReport" xmlns:my="http://schemas.microsoft.com/office/infopath/2003/myXSD" xmlns:xhtml="http://www.w3.org/1999/xhtml" xml:lang="en-us">
<exp:employee>
<exp:name>
<exp:prefix />
<exp:givenName />
<exp:middleName />
<exp:surname />
<exp:suffix />
Note the multiple namespaces specified. When we use this file for XMLSource adapter, and ask the XMLSource to infer a schema for the xml file by pressing "Generate XSD…" button in the UI, an error will pop up saying
"Unable to infer the XSD from the XML file. The XML contains multiple namespaces."
Work around:
Unless the multiple namespaces in the above xml make real differences in your scenario, you can always try to use only one single namespace to work around the problem. There are two options:
- Manually remove the unnecessary namespaces specified on the XML file.
- Use a SSIS XMLTask XSLT to strip out namespaces.
XMLTask supports an operation called XSLT, in which users can specify a XML Source, a XSL style sheet source, and then execute to apply the style sheet on the XML source to get a file in new format. We can use a style sheet to get rid of multiple namespaces from the XML file.
I have a sample xsl style sheet which can be used to strip out all namespaces(use this only in case when multiple namespaces don’t make sense to your scenario)
<?xml version="1.0" encoding="utf-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="no" />
<xsl:template match="/|comment()|processing-instruction()">
<xsl:copy>
<xsl:apply-templates />
</xsl:copy>
</xsl:template>
<xsl:template match="*">
<xsl:element name="{local-name()}">
<xsl:apply-templates select="@*|node()" />
</xsl:element>
</xsl:template>
<xsl:template match="@*">
<xsl:attribute name="{local-name()}">
<xsl:value-of select="." />
</xsl:attribute>
</xsl:template>
</xsl:stylesheet>
…
The way to use this style sheet is: In SSIS designer, add a XMLTask, choose XSLT operation, then point "Source" to your XML source file and “SecondOperand” to your XSL file("File Connection" as the Type can be used so as to conveniently pointing to files in both cases), set “SaveOperationResult” to true to obtain the XSLT operation result. Then use this result in XMLSource in the downstream dataflow task.
Note: <?mso-infoPath…> will be ignored by XMLSource, so it’s ok to leave it there.
Wrap up the fields in a single layer XML source
An extreme example of using InfoPath xml source is when the form has only one single level, no hierarchy. In other words, originally the XML file is something like
<?xml version="1.0" encoding="UTF-8"?>
<?mso-infoPathSolution solutionVersion="1.0.0.2" productVersion="11.0.6357" PIVersion="1.0.0.0" href="file:///c:\infofile.xsn" name="urn:schemas-microsoft-com:office:infopath:Info1:-myXSD-2005-04-27T19-26-55" ?>
<?mso-application progid="InfoPath.Document"?>
<my:myFields xmlns:my="http://schemas.microsoft.com/office/infopath/2003/myXSD/2005-04-27T19:26:55" >
<my:FirstName>Wenyang</my:FirstName>
<my:LastName>Hu</my:LastName>
<my:PhoneNumber>425-123-4567</my:PhoneNumber>
</my:myFields>
XMLSource will think all fields under the root – "FirstName"/"LastName"/"PhoneNumber" as tables, instead of columns which belong to a same table. That’s not going to work.
Work Around
The way to work around this is, after removing the unnecessary namespaces, add a pair of tags to wrap the fields. So change the above to
<?xml version="1.0" encoding="UTF-8"?>
<myTable>
<myFields>
<FirstName>Wenyang</FirstName>
<LastName>Hu</LastName>
<PhoneNumber>425-123-4567</PhoneNumber>
</myFields>
</myTable>
Then the columns will be retrieved from XMLSource successfully - there will be an output called "myFields" and "FirstName"/"LastName"/"PhoneNumber” will be the output columns of that output, as expected.
Deal with the fields with missing values
<?xml version="1.0" encoding="utf-8" ?>
<expenseReport lang="en-us">
<employee>
<name>
<prefix />
<givenName />
<middleName />
<surname />
<suffix />
<singleName>John</singleName>
</name>
<address>
<line1>1234 56th AVE</line1>
<line2>Unit B101</line2>
<line3 />
<line4 />
<city>Redmond</city>
<stateProvince>WA</stateProvince>
<postalCode>98034</postalCode>
<country>USA</country>
</address>
…
<identificationNumber>1336</identificationNumber>
<email>johns@hotmail.com</email>
</employee>
</expenseReport>
It's common that the InfoPath users to leave certain fields in a form blank, like the "prefix" and "givenName" fields in “name” above. Then in XMLSource, if the users ask the XMLSource to infer a XSD, the inferred schema won’t set the data types for those fields, then a bug in XMLDataReader will cause all those fields be treated as tables. That's very wrong.
Work Around
Before getting the fix for the bug, work arounds for this problem include
- Give values for all fields.
- Sometime is not acceptable. In that case, ask the XMLSource to infer a XSD file first, then manually edit that XSD file, add "type="xs:string" to that element.
For example, change
<xs:element minOccurs="0" name="prefix"/> to
<xs:element minOccurs="0" name="prefix" type="xs:string>