Optical Character Recognition (OCR) is now available in Microsoft Syntex – which allows extraction of text from images. Both printed and handwritten text can be extracted from images such as posters, labels, forms, invoices, site surveys etc.
All text that is recognised is extracted to a SharePoint column called “Extracted Text” (internal name: MediaServiceOCR). The means the extracted text is available digitally, indexed by search (so available for querying in search) & for compliance features such as data loss prevention (DLP).
All of this from within a SharePoint document library where an image (JPEG, JPG, PNG & BMP – currently supported. Note PDF as an image is not currently supported but on the M365 roadmap for release soon) is added. Syntex then scans the image and extracts the text it recognises. As mentioned the extracted text can be used for searching to find keywords and phrases contained in the images.
The possibilities are endless as the extracted text could be used in other “workflows” (Power Automate, Logic Apps etc) & even other systems CRM, populate a database etc One other great Syntex use case could be use Content Assembly to generate a document using the extracted text. The OCR service also supports recognising text in over 150 languages!
Setup Microsoft Syntex OCR
Below I will walk you through the process of enabling Syntex OCR in your tenant and to test it by adding images containing text.
In the M365 Syntex admin centre ensure your Azure subscription is setup and then click on Manage Microsoft Syntex.
Click on Optical character recognition (preview)
Here are the current configuration options to select the SharePoint libraries where you would like to enable optical character recognition.
There are three options – Libraries in all SharePoint sites, libraries in selected sites (search for site or upload a csv of site urls) or no SharePoint libraries. I’m targeting Microsoft Syntex OCR to one site so I searched for my Syntex site and added it.
NOTE – slightly confusing the wording talks about enabling Microsoft Syntex OCR on libraries but then only allows you to select sites. There is presently no ability to select particular libraries – so if a site is selected then all libraries in the site would be enabled for Syntex OCR – which might not be what you wanted.
That’s literally all the configuration needed & is possible at the current time. Now on an enabled site add some images with text to a library and then wait (took around 30 minutes for me) for the image to be processed.
Here is the end result – works really well including some challenging handwritten text!
NOTE: on libraries in sites when Microsoft Syntex OCR is enabled, only when images are added to the library & processed by Syntex OCR will the special Extracted Text column be added to the library. This will then be populated with the extracted text when the document is processed. In the picture above I have added this column to the view of the library along with adding a Thumbnail image column to provide a preview of the image.
Extending Microsoft Syntex OCR with Search
I will now show you how this Extracted Text can now be used to search for words and phrases within images that have been processed using Microsoft Syntex OCR.
First map a SharePoint search managed property – I chose an unmapped RefinableString (in my case RefinableString100) and then mapped it to the crawled property OWS_MEDIASERVICEOCR. This means that for each Syntex OCR’d image the RefinableString will be populated with the extracted text and I can use this in search to search over the extracted text.
I will now show you how you can use this Managed Property to formulate a search query to find all the Syntex OCR’d images in your tenant that contain a word in my case “fish”.
Here is the Image (of text) I have uploaded to my Syntex OCR document library
Here is the text that Syntex OCR extracted for the image (a few typos/or letters missed but generally it is all there).
I use a search query RefinableString100:”*fish*” which means search over items that have RefinableString100 populated and find any matches where the phrase “fish” is mentioned anywhere in the text. Note an asterisk (*) is used to denote wildcard(can be any text/characters) before and after the word.
Below I have used the great SharePoint Search Query tool to show the search query in action & the search results returned along with the managed properties returned for the matching image. You can see below that RefinableString100 is populated with the extracted text. This means all the extracted text is available for searching!
You could also use Path:”https://tenantname.sharepoint.com/sites/sitename/libraryname/*” for example to restrict the query to a particular library.
I can also use the search query “RefinableString100:”*fish*” in the SharePoint search box and the image is found.
To further extend this we could use data loss prevention (DLP) in Purview to prevent sharing of documents where a keyword or phrase appears in them using the extracted text.
Syntex OCR seems to work very well at extracting text from images. It works great with handwritten text and even works very well with illegible handwritten text. This is a great addition to the Syntex product suite as it will bring static images of text to life and enable staff to be able to search for phrases or keywords within the images. The cost is currently $0.001 per image processed by Syntex OCR.
You can see below it is relatively accurate and the wording makes sense – there are a few words that have been recognised incorrectly/missed. Overall most of the text extracted is correct so you can definitely read and understand what has been extracted.
One slight limitation at the moment is the supported file types are only images (JPEG, JPG, PNG & BMP) and there is no support for PDFs that have been scanned as images i.e. the text cannot be selected.
EDIT: This is on the M365 Roadmap (124940) to bring support soon for multipage PDF and TIFF files!
This would be a HUGE win for my customers who have lots of old PDFs that have been scanned “as is” or before scanners that used to be OCR enabled too. They would love Syntex OCR to be able to extract the text from these PDF images instead of using third party services.
Still Syntex OCR seems to work very well for images and cant wait to see my customers use it and be able to search their images with Syntex OCR and Syntex Image Tagging.