Discussion:
[XeTeX] XeTeX/xdvipdfmx: PDF text copying with double encoded fonts
Jiang Jiang
2017-06-12 04:26:13 UTC
Permalink
Hi,

There has been a report

https://github.com/CTeX-org/ctex-kit/issues/286

(in Chinese)

in summary, fonts such as Source Han Sans/Serif encoded U+5B50 and
U+2F74 using the same glyph so that our reverse ToUnicode CMap
building seems to prefer U+2F74 which is not the expected result.
(U+2F74 is KANGXI RADICAL while U+5B50 is a real Chinese character.)

I suggested trying \XeTeXgenerateactualtext=1 but the comment below
suggested it didn’t work.

Problem is that on macOS I cannot reproduce it with either Preview or
Adobe Reader, with \XeTeXgenerateactualtext=1 or not.

Anyone interested in reproducing it?

- Jiang



--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http:/
Akira Kakuto
2017-06-12 07:38:05 UTC
Permalink
Dear Jiang,
Post by Jiang Jiang
There has been a report
https://github.com/CTeX-org/ctex-kit/issues/286
% XeTeX
\XeTeXgenerateactualtext=1
\font\1="Source Han Sans SC"
\font\2="Source Han Serif SC"
\font\3="Microsoft YaHei"

{\1 孤立子 ABC} \par
{\2 孤立子 ABC} \par
{\3 孤立子 ABC} \par

\bye

I tested the above example on Windows 7.
Used fonts are
(1) SourceHanSansSC-Regular.otf
(2) SourceHanSerifSC-Regular.otf
(3) msyh.ttf

I'm afraid that I don't understand the problem correctly.
My Results:

In the case of \XeTeXgenerateactualtext=0, copy&paste
is wrong for "立" in fonts (1) and (2), in Adobe Reader DC.
In the case of \XeTeXgenerateactualtext=1, copy&pase
is OK in Adobe Reader DC for all fonts (1), (2) and (3).

Best,
Akira



--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.
Zdenek Wagner
2017-06-12 07:41:58 UTC
Permalink
Post by Akira Kakuto
Dear Jiang,
There has been a report
Post by Jiang Jiang
https://github.com/CTeX-org/ctex-kit/issues/286
% XeTeX
\XeTeXgenerateactualtext=1
\font\1="Source Han Sans SC"
\font\2="Source Han Serif SC"
\font\3="Microsoft YaHei"
{\1 孀立子 ABC} \par
{\2 孀立子 ABC} \par
{\3 孀立子 ABC} \par
\bye
I tested the above example on Windows 7.
Used fonts are
(1) SourceHanSansSC-Regular.otf
(2) SourceHanSerifSC-Regular.otf
(3) msyh.ttf
I'm afraid that I don't understand the problem correctly.
In the case of \XeTeXgenerateactualtext=0, copy&paste
is wrong for "立" in fonts (1) and (2), in Adobe Reader DC.
In the case of \XeTeXgenerateactualtext=1, copy&pase
is OK in Adobe Reader DC for all fonts (1), (2) and (3).
As I found on PDF viewers on Linux, some of them ignore ActualText
completely, some of them use it when saving as text but not with
copy&paste. I have not tried on Mac yet.
Post by Akira Kakuto
Best,
Akira
Zdeněk Wagner
http://ttsm.icpf.cas.cz/team/wagner.shtml
http://icebearsoft.euweb.cz
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
Jiang Jiang
2017-06-12 13:59:05 UTC
Permalink
Post by Akira Kakuto
Dear Jiang,
Post by Jiang Jiang
There has been a report
https://github.com/CTeX-org/ctex-kit/issues/286
% XeTeX
\XeTeXgenerateactualtext=1
\font\1="Source Han Sans SC"
\font\2="Source Han Serif SC"
\font\3="Microsoft YaHei"
{\1 孀立子 ABC} \par
{\2 孀立子 ABC} \par
{\3 孀立子 ABC} \par
\bye
I tested the above example on Windows 7.
Used fonts are
(1) SourceHanSansSC-Regular.otf
(2) SourceHanSerifSC-Regular.otf
(3) msyh.ttf
I'm afraid that I don't understand the problem correctly.
In the case of \XeTeXgenerateactualtext=0, copy&paste
is wrong for "立" in fonts (1) and (2), in Adobe Reader DC.
In the case of \XeTeXgenerateactualtext=1, copy&pase
is OK in Adobe Reader DC for all fonts (1), (2) and (3).
Thank you for trying this, that's what I thought, do you mind sending me
the PDF file generated with \XeTeXgenerateactualtext=1? Chances are the
originator didn't use it properly.

- Jiang
Post by Akira Kakuto
Best,
Akira
Jiang Jiang
2017-06-12 16:53:55 UTC
Permalink
Hi Akira,
I have tried some PDF viewers. With \XeTeXgenerateactualtext=1,
Adobe Reader DC, Adobe Acrobat DC will give the correct result, while SumatraPDF v3.1.2, Windows Reader App (阅读器), Microsoft Edge and Microsoft Word 2016 will not.
My OS is Windows 10 1607 (Build 14393.1198).
So there might still be value to fix the ToUnicode map, don't you think?

I have an experimental patch at
https://github.com/jjgod/texlive/commit/f01557d549aaf27584f624fa540f6b4b05349bf3
in case you would like to build a w32tex binary for him to test.
(Actually you can ignore rest of the change and only take the
is_PUA_or_presentation() change.)
Post by Akira Kakuto
Dear Jiang,
Post by Jiang Jiang
There has been a report
https://github.com/CTeX-org/ctex-kit/issues/286
% XeTeX
\XeTeXgenerateactualtext=1
\font\1="Source Han Sans SC"
\font\2="Source Han Serif SC"
\font\3="Microsoft YaHei"
{\1 孤立子 ABC} \par
{\2 孤立子 ABC} \par
{\3 孤立子 ABC} \par
\bye
I tested the above example on Windows 7.
Used fonts are
(1) SourceHanSansSC-Regular.otf
(2) SourceHanSerifSC-Regular.otf
(3) msyh.ttf
I'm afraid that I don't understand the problem correctly.
In the case of \XeTeXgenerateactualtext=0, copy&paste
is wrong for "立" in fonts (1) and (2), in Adobe Reader DC.
In the case of \XeTeXgenerateactualtext=1, copy&pase
is OK in Adobe Reader DC for all fonts (1), (2) and (3).
Thank you for trying this, that's what I thought, do you mind sending me the
PDF file generated with \XeTeXgenerateactualtext=1? Chances are the
originator didn't use it properly.
- Jiang
Post by Akira Kakuto
Best,
Akira
--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://t
Akira Kakuto
2017-06-12 22:31:20 UTC
Permalink
Hi Jiang,
Post by Jiang Jiang
So there might still be value to fix the ToUnicode map, don't you think?
I have an experimental patch at
https://github.com/jjgod/texlive/commit/f01557d549aaf27584f624fa540f6b4b05349bf3
in case you would like to build a w32tex binary for him to test.
I confirmed that after applying your patch, ToUnicode map becomes
correct, and copy&paste is fine for the example with
\XeTeXgenerateactualtext=0.
Interested users can replace binary for 32bit windows bin/win32/dvipdfmx.dll
by downloading:

http://www.w32tex.org/toolsw32/dvipdfmx.dll

Please commit your patch after Karl starts 2018/dev.

Best,
Akira



--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xetex
Jiang Jiang
2017-06-12 22:36:09 UTC
Permalink
Thank you for building this. I'm not familiar with w32tex, can users
replace the same file from their TexLive 2017 installation?
Post by Akira Kakuto
Hi Jiang,
Post by Jiang Jiang
So there might still be value to fix the ToUnicode map, don't you think?
I have an experimental patch at
https://github.com/jjgod/texlive/commit/f01557d549aaf27584f624fa540f6b4b05349bf3
in case you would like to build a w32tex binary for him to test.
I confirmed that after applying your patch, ToUnicode map becomes
correct, and copy&paste is fine for the example with
\XeTeXgenerateactualtext=0.
Interested users can replace binary for 32bit windows bin/win32/dvipdfmx.dll
http://www.w32tex.org/toolsw32/dvipdfmx.dll
Please commit your patch after Karl starts 2018/dev.
Best,
Akira
--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xetex
Akira Kakuto
2017-06-12 22:44:31 UTC
Permalink
Hi Jiang,
Post by Jiang Jiang
Thank you for building this. I'm not familiar with w32tex, can users
replace the same file from their TexLive 2017 installation?
Yes, I think that dvipdfmx.dll in the TeX Live 2017 can be replaced by
the new dvipdfmx.dll.

Best,
Akira



--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xetex

Akira Kakuto
2017-06-12 08:33:17 UTC
Permalink
% luatex-plain test.tex
% luatex-plain is in the ConTeXt
\font\1="SourceHanSansSC-Regular.otf"
\font\2="SourceHanSerifSC-Regular.otf"
\font\3="msyh.ttf"

{\1 孤立子 ABC} \par
{\2 孤立子 ABC} \par
{\3 孤立子 ABC} \par

\bye

In the case of "luatex-plain" in the ConTeXt package,
copy&paste was OK by Adobe Reader DC for all
fonts (1), (2), and (3).

Best,
Akira



--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xet
Loading...