Discussion:
[XeTeX] Type0 fonts somehow not built correctly for Unicode text-extraction and Accessibility
Ross Moore
2018-08-05 22:11:02 UTC
Permalink
There seems to be a subtle problem with the way subsetted Type0 fonts are built
by xdvipdfmx with XeLaTeX jobs, for the purposes of finding the /ToUnicode resource.

The main view is fine, but when checking other aspects, for standards compliance, some basic tests fail.
See e.g. with included image.

[cid:4F0A2FCB-B291-48D1-9450-***@telstra.com.au]


Firstly, the CIDSet is not built correctly, by not including all glyphs that are used.
pdfTeX hs a similar problem with regard to Charset.
The issue seems to be that if an accented character is built internally from multiple glyphs,
then each of those glyphs should be included in the CIDSet, as well as the combined character.

Acrobat’s Preflight has a filter to remove such incomplete CIDSets, so this isn’t a crucial deficiency.


Secondly, although clearly present, the /ToUnicode CMap resource is not being found.
The font seems to be named correctly here, according to:


page 279 of ISO 32000_1:2008

§ 9.7.6 Type 0 Font Dictionaries
§ 9.7.6.1 General
A Type 0 font dictionary contains the entries listed in Table 121.

Table 121 – Entries in a Type 0 font dictionary

BaseFont name (Required) The name of the font.
If the descendant is a Type 0 CIDFont, this name should be the concatenation of the CIDFont’s BaseFont name, a hyphen,
and the CMap name given in the Encoding entry (or the CMapName entry in the CMap).
If the descendant is a Type 2 CIDFont, this name should be the same as the CIDFont’s BaseFont name.

Since this is a Type 2 CIDFont, the 2nd sentence is applicable.

And since it is a subset of the full font, the last sentence below is also applicable.

page 285 of ISO 32000_1:2008

§9.8.3 Font Descriptors for CIDFonts
§9.8.3.1 General
In addition to the entries in Table 122, the FontDescriptor dictionaries of CIDFonts may contain the entries listed in Table 124.

Table 124 – Additional font descriptor entries for CIDFonts

CIDSet stream (Optional) A stream identifying which CIDs are present in the CIDFont file.
If this entry is present, the CIDFont shall contain only a subset of the glyphs in the character collection defined by the CIDSystemInfo dictionary.
If it is absent, the only indication of a CIDFont subset shall be the subset tag in the FontName entry (see 9.6.4, "Font Subsets").


So I cannot see why the /ToUnicode resource is not being found.


Would someone with more experience building fonts and subsetting, please have a look at this issue.



Cheers,

Ross


Dr Ross Moore
Mathematics Dept | 12 Wally’s Walk, 734
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955 | F: +61 2 9850 8114
M:+61 407 288 255 | E: ***@mq.edu.au<mailto:***@mq.edu.au>

http://www.maths.mq.edu.au


[cid:***@01D030BE.D37A46F0]<http://mq.edu.au/>




CRICOS Provider Number 00002J. Think before you print.
Please consider the environment before printing this email.

This message is intended for the addressee named and may
contain confidential information. If you are not the intended
recipient, please delete it and notify the sender. Views expressed
in this message are those of the individual sender, and are not
necessarily the views of Macquarie University.
Ross Moore
2018-08-06 23:07:50 UTC
Permalink
Hi all.

I think I’ve found a possible cause for this /ToUnicode problem.
It’s with the way the /CMapName is constructed within the CMap resource itself,
at least when the font's name contains spaces.

See the attached image, where the window on the left is from a PDF constructed by XeLaTeX,
while the one on the right comes from the PDF/UA Association, and is properly valid.


[cid:26EE79DA-D1DD-4BDA-B30F-***@telstra.com.au]


Because the space character is normally a delimiter, this is certainly invalid Postscript coding
to assign a value to /CMapName . So presumably it’s wrong in PDF too.
Surely the space needs to be encoded as #20 here?
The ‘.’ and ‘,’ are questionable. I think these are actually OK.


Changing the font to ‘Times’, the resulting PDF validates just fine.


Is it really a good idea to use the full path to the file, as the name here?
The PDF spec says it should be the name used in the file: viz.


CMapName


name


(Required) The name of the CMap. It shall be the same as the value of CMapName in the CMap file.




BTW, there was also an issue with Ghostscript, concerning the way CMapName is constructed.
see https://bugs.ghostscript.com/show_bug.cgi?id=690114 .
There is was the // at the start of the name that was questioned.
dvipdfmx seems to be encoding the directory delimiter as a `-` now.



On 6 Aug 2018, at 8:10 am, Ross Moore <***@mq.edu.au<mailto:***@mq.edu.au>> wrote:

There seems to be a subtle problem with the way subsetted Type0 fonts are built
by xdvipdfmx with XeLaTeX jobs, for the purposes of finding the /ToUnicode resource.


So I cannot see why the /ToUnicode resource is not being found.

This error in naming is almost certainly the reason.


Cheers,

Ross


Dr Ross Moore

Mathematics Dept | 12 Wally’s Walk, 734
Macquarie University, NSW 2109, Australia

T: +61 2 9850 8955 | F: +61 2 9850 8114<tel:%2B61%202%209850%209695>
M:+61 407 288 255<tel:%2B61%20409%20125%20670> | E: ***@mq.edu.au<mailto:***@mq.edu.au>

http://www.maths.mq.edu.au<http://mq.edu.au/>


[cid:***@01D030BE.D37A46F0]<http://mq.edu.au/>




CRICOS Provider Number 00002J. Think before you print.
Please consider the environment before printing this email.<http://mq.edu.au/>

This message is intended for the addressee named and may
contain confidential information. If you are not the intended
recipient, please delete it and notify the sender. Views expressed
in this message are those of the individual sender, and are not
necessarily the views of Macquarie University.<http://mq.edu.au/>
Loading...