You can extract text or barcodes from a scanned document using optical character
recognition (OCR) and use them as automatic property values for files imported from
an external source, a scanner in this case. The OCR value source is a zone defined
on a scanned page. For more information on defining different properties for objects
imported from external file sources, see Defining Metadata for an External File Source.
You can use optical character recognition with these file formats:
TIF
TIFF
JPG
JPEG
BMP
PNG
PDF
TIFF files that use an alpha channel or JPEG compression are not supported.
The use of an OCR value source is only possible when using an external source. The
OCR value source cannot be defined in M-Files Desktop.
Note: You can use the OCR value source without enabling the Use OCR to enable
full-text search of scanned documents option in the
Searchable PDF tab.
Do the following steps to define an OCR value source:
Open M-Files Admin.
In the left-side tree view, expand a connection to M-Files server.
Expand Document Vaults.
Expand a vault.
Expand Connections to External Sources
Click File Sources.
On the File Sources list, double-click the file source
that you want to edit.
Result:The Connection Properties dialog is opened.
Click the Metadata tab.
Result:The Metadata tab is opened.
Click Add... to define a new property and value to be
added automatically for objects created from external files, or select one of
the existing properties and click Edit... to edit the
existing property.
Result:The Define Property dialog is opened.
Select the option Use an OCR value source and click the
Define... button.
Result:The OCR Value Source Definition dialog is
opened.
In the Zone type section, select either:
Text: Select this option if the OCR zone contains
text.
or
Barcode: Select this option if the OCR zone
contains a barcode.
In the Zone position section, define a zone from which to
extract a value for the selected property. The characters may include any
letters, numbers or punctuation marks. For example, an invoice number shown on a
page can be added as the Invoice number property value for the scanned
document.
Example:An example of a zone definition:
If you are capturing a barcode and there is only one barcode to recognize on
the page, you can specify the whole page as the zone. If there are several
barcodes, restrict the zone in a such a way that it contains the desired barcode
only. With QR codes, you should specify a zone larger than the actual barcode.
If the specified zone has several barcodes, all of them are considered to be a
property value.
In the Page field, enter the page number of the
scanned document that you want to use as the OCR value source.
Using the Unit options, select the appropriate
unit for defining the zone position.
In the Left field, enter the left corner position
of the OCR zone. The left corner of the scanned document is considered
"0".
In the Right field, enter the right corner
position of the OCR zone.
In the Top field, enter the top corner position of
the OCR zone. The top corner of the scanned document is considered
"0".
In the Bottom field, enter the bottom corner
position of the OCR zone.
With the Primary language and Secondary
language drop-down menus, select the primary and secondary
language of the scanned documents to improve the
quality of the recognition results. The list of secondary languages only
contains languages that are allowed to be used with the selected primary
language.
Although the OCR automatically recognizes all Western languages and Cyrillic
character sets, specifying a language selection often improves the quality of
the text recognition results. In ambiguous cases, a problematic recognition
result may be resolved by a language-specific factor, such as recognition of the
letter 'Ä' in Finnish. The list of secondary languages only includes languages
that are allowed to be used together with the selected primary language.
Click OK to close the OCR Value Source
Definition dialog.
Back in the Define Property dialog, select either:
Use the value read as the ID of the item: Select
this option if you want to use the captured value as an identifier of the
value list item with a separately defined name.
or
Use the value read as the name of the item: Select
this option if you want to use the captured value as the name of the value
list item. You can check the Add a new item to the list if a
matching item is not found check box if you want to add a new
value list item whenever a new value is captured.
Click OK to close the Define
Property dialog.
The zone you have just defined is used to automatically
extract a value for the selected property using OCR whenever a new object is created via
the selected external file source.
To make sure that the specified zone is correctly positioned,
in most cases the document to be scanned must be placed onto the scanner glass by
hand.
In some cases, the OCR can give an incorrect recognition result of
the text. For example, depending on the font
type or size, the number 1 can be interpreted as the letter I. To
make sure that the characters are added correctly to metadata, you can
check the property values with event handlers and VBScript. You can then use
VBScript to check, for example, that all added characters are numbers. For more
information, see Event Handlers.
Supported Barcode Types
The M-Files OCR module supports the following barcode types: