[XeTeX] Type0 fonts somehow not built correctly for Unicode text-extraction and Accessibility

Ross Moore

2018-08-05 22:11:02 UTC

There seems to be a subtle problem with the way subsetted Type0 fonts are built
by xdvipdfmx with XeLaTeX jobs, for the purposes of finding the /ToUnicode resource.

The main view is fine, but when checking other aspects, for standards compliance, some basic tests fail.
See e.g. with included image.

[cid:4F0A2FCB-B291-48D1-9450-***@telstra.com.au]

Firstly, the CIDSet is not built correctly, by not including all glyphs that are used.
pdfTeX hs a similar problem with regard to Charset.
The issue seems to be that if an accented character is built internally from multiple glyphs,
then each of those glyphs should be included in the CIDSet, as well as the combined character.

Acrobatâs Preflight has a filter to remove such incomplete CIDSets, so this isnât a crucial deficiency.

Secondly, although clearly present, the /ToUnicode CMap resource is not being found.
The font seems to be named correctly here, according to:

page 279 of ISO 32000_1:2008

Â§ 9.7.6 Type 0 Font Dictionaries
Â§ 9.7.6.1 General
A Type 0 font dictionary contains the entries listed in Table 121.

Table 121 â Entries in a Type 0 font dictionary

BaseFont name (Required) The name of the font.
If the descendant is a Type 0 CIDFont, this name should be the concatenation of the CIDFontâs BaseFont name, a hyphen,
and the CMap name given in the Encoding entry (or the CMapName entry in the CMap).
If the descendant is a Type 2 CIDFont, this name should be the same as the CIDFontâs BaseFont name.

Since this is a Type 2 CIDFont, the 2nd sentence is applicable.

And since it is a subset of the full font, the last sentence below is also applicable.

page 285 of ISO 32000_1:2008

Â§9.8.3 Font Descriptors for CIDFonts
Â§9.8.3.1 General
In addition to the entries in Table 122, the FontDescriptor dictionaries of CIDFonts may contain the entries listed in Table 124.

Table 124 â Additional font descriptor entries for CIDFonts

CIDSet stream (Optional) A stream identifying which CIDs are present in the CIDFont file.
If this entry is present, the CIDFont shall contain only a subset of the glyphs in the character collection defined by the CIDSystemInfo dictionary.
If it is absent, the only indication of a CIDFont subset shall be the subset tag in the FontName entry (see 9.6.4, "Font Subsets").

So I cannot see why the /ToUnicode resource is not being found.

Would someone with more experience building fonts and subsetting, please have a look at this issue.

Cheers,

Ross

Dr Ross Moore
Mathematics Dept | 12 Wallyâs Walk, 734
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955 | F: +61 2 9850 8114
M:+61 407 288 255 | E: ***@mq.edu.au<mailto:***@mq.edu.au>

http://www.maths.mq.edu.au

[cid:***@01D030BE.D37A46F0]<http://mq.edu.au/>

CRICOS Provider Number 00002J. Think before you print.
Please consider the environment before printing this email.

This message is intended for the addressee named and may
contain confidential information. If you are not the intended
recipient, please delete it and notify the sender. Views expressed
in this message are those of the individual sender, and are not
necessarily the views of Macquarie University.