There has been a new update to document understanding models in SharePoint Syntex to allow adding refinement rules to entity extractors.. Entity extractors are rules created in a model to extract a specific piece of information in a document i.e. Client name or Contract date etc. The new refinement rule functionality allows a user the control to create a rule to specify to remove duplicate entities, extract only a certain number of values or lines from the entity extractor value. This happens at the same time the entity extractor is invoked on the document and allows greater control of the returned values. There is a new Refine extracted info button now available in the entity extractors section of a document understanding model.

In this blog I will test out all of the new Refinement rules, present a use case for each rule including example. BONUS: I have added my sample model to PnP Syntex Samples repository so you can download a model using all of the new refinement rules functionaliy.

Refine extracted info button now available in the entity extractors section of a document understanding model

The full list of refinement rules currently are available below:

  • Keep one or more of the first values
  • Keep one or more of the last values
  • Remove duplicate values
  • Keep one or more of the first lines
  • Keep one or more of the last lines

Testing Out All Refinement Rules

There wasn’t much documentation with examples to show all of these rules in action and how they work in documents and their intended use cases. So I did a bit of trial and error by creating some demo files in various formats to test all the rules out and learn about them.

This was the document format (Report) that worked best for me in my document understanding model. My objective with this document was to test the line value functionality using Section 1 Summary (highlighted in green/blue) and select first/last lines. Then use the Section Author values (highlighted in yellow) which occur in multiple sections to test the remove duplicate functionality and select first/last values. I will create extractors with refinement rules to test/demonstrate each of the five available refinement rule functions.

Below are my findings for each of the rules. If you want to download a working document understanding model with all of these five refinement rules in action including demo files then I have added this model to the PnP Syntex Samples GitHub repository (awaiting pull request to be merged). You can download the model from the repository and install in your tenant today to see how it works.

Keep one or more of the first values

Entity extractor created named Section Authors (First Named) which has one explanation rule – “Before label” = “Section Author:”.

This rule is useful where you may have multiple values extracted by your entity extractor. In my Report document Section Author appears at the end of every section so appears multiple times. If I wanted to keep one or more of the first Section authors values extracted I can implement this refinement rule.

Below is a table of how I expect the rule to work with the values extracted by the extractor and then after the refinement rule has been run.

TypeValues Extracted by ExtractorRefinement Result
Keep one or more of the first valuesAndy King, Shinji Okazaki, Shinji OkazakiAndy King

Below is the result of the Refinement rule which I configured to select the first value, you can see the prediction has found three values for each of the custom authors in my report and then after the refinement rule has been invoked that the first value Andy King has been chosen

Keep one or more of the last values

Entity extractor created named Section Authors (Last Named) which has one explanation rule – “Before label” = “Section Author:”.

This rule is the same as Keep one or more of the first values except this time it works from the bottom up and is reversed to work from the last backwards. This refinement rule is again useful where you may have multiple values extracted by an entity extractor. In my document Section Author appears at the end of every section so appears multiple times. If I wanted to keep one or more of the last Section author values extracted I can implement this rule

Below is a table of how I expect the rule to work with the values extracted by the extractor and then after the refinement rule has been run.

TypeValues Extracted by ExtractorRefinement Result
Keep one or more of the last valuesAndy King, Shinji Okazaki, Shinji OkazakiShinji Okazaki

Below is the result of the Refinement rule which I configured to select the last value, you can see the prediction has found three values and then after the refinement rule has been invoked that the last value Shinji Okazaki has been chosen

Remove duplicate values

Entity extractor created named Section Authors (No Duplicates) which has one explanation rule – “Before label” = “Section Author:”.

This rule is useful where you may have multiple values extracted by your entity extractor and you wish to remove any duplicate values. In my document Section Author appears at the end of every section so appears multiple times and some authors have written multiple sections. I wish to remove all the duplicate section authors so they only appear once in the list.

Below is a table of how I expect the rule to work with the values extracted by the extractor and then after the refinement rule has been run to remove duplicate values.

TypeValues Extracted by ExtractorRefinement Result
Remove duplicate valuesAndy King, Shinji Okazaki, Shinji OkazakiAndy King, Shinji Okazaki

Below is the result of the Refinement rule which I configured to select the last value, you can see the prediction has found three values and then after the refinement rule has been invoked that the duplicate Shinji Okazaki value has been removed.

Keep one or more of the first lines

Entity extractor created named Section 1 Summary (First Line)  which has two explanation rules – “Before label” = “Section 1 Summary:” &.“After label” = “Section Word Count:”

This rule is useful when using Syntex to extract multiple lines of text i.e. a section of text split over multiple lines with a line break between each line.

In my document Section 1 Summary I have created five lines and the text on each line reflects which line number it is i.e. line one, line two etc.. I will now use this refinement rule to select just the first line of the section

Below is a table of how I expect the rule to work with the value extracted by the extractor and then after the refinement rule has been run.

TypeValues Extracted by ExtractorRefinement Result
Keep one or more of the first linesThis is line one, this is line one, this is line one, this is line one and line break.
This is line two, this is line two, this is line two this is line two, this is line two, and line break.
This is line three, this is line three, this is line three, this is line three, this is line three, and line break.
This is line four, this is line four, this is line four, this is line four, this is line four, and line break.
This is line five, this is line five, this is line five, this is line five, this is line 5, this is line 5, & line break.
This is line one, this is line one, this is line one, this is line one and line break.

Below is the result of the Refinement rule which I configured to select the first line, you can see the prediction has found the whole section and then after the refinement rule has been invoked that only the first line (one) has been kept.

Keep one or more of the last lines

Entity extractor created named Section 1 Summary (Last Line)  which has two explanation rules – “Before label” = “Section 1 Summary:” &.“After label” = “Section Word Count:”

This rule is useful when using Syntex to extract multiple lines of text i.e. a section of text split over multiple lines with a line break between each line. In my document Section 1 Summary I have created five lines and the text on each line reflects which line number it is i.e. line one, line two etc.. I will now use this refinement rule to select just the last line of the section

Below is a table of how I expect the rule to work with the value extracted by the extractor and then after the refinement rule has been run.

TypeValues Extracted by ExtractorRefinement Result
Keep one or more of the last linesThis is line one, this is line one, this is line one, this is line one and line break.
This is line two, this is line two, this is line two this is line two, this is line two, and line break.
This is line three, this is line three, this is line three, this is line three, this is line three, and line break.
This is line four, this is line four, this is line four, this is line four, this is line four, and line break.
This is line five, this is line five, this is line five, this is line five, this is line 5, this is line 5, & line break.
This is line five, this is line five, this is line five, this is line five, this is line 5, this is line 5, & line break.

Below is the result of the Refinement rule which I configured to select the first line, you can see the prediction has found the whole section and then after the refinement rule has been invoked that only the last line (five) has been kept.

Summary

This took a bit of trial and error creating a few different sample documents in various formats to try and identify exactly what all of the refinement rules do and how they work. I now understand them and was then able to create a sample report document to configure a document understanding model with extractors using all of these refinement rules.

Report document understanding model using all of the refinement rules applied to a library

This gives you greater control in Syntex document understanding models to train your model to further refine the information returned and can see this being useful in lots of scenarios. The only negative I would say is the select line functionality only seems to work only when line breaks (i.e. pressing the Enter key on your keyboard) have been used in your sections. It would be nice if a line could be split on a full stop (period) or comma for example – hopefully this will come in a future update!

I hope this blog is a help to you figuring out refinement rules and provides some visual examples of what the rules do. As mentioned previously I will be submitting my model using all of the refinement rules with all sample documents to the PnP Syntex Samples GitHub repository. So I encourage you to visit the repository & download the Report model then deploy the model to your tenant to see it in action.

Please let me know if you have any questions or feedback regarding this blog or have any Syntex questions? Why not check out some of my other Syntex blogs or connect with me on Twitter for other Syntex news

Leave a Reply