Discussion:
[XeTeX] xelatex to doc?
Hueckstedt, Robert A. (rah2k)
2018-01-30 17:57:10 UTC
Permalink
With a publisher's permission I used xelatex to provide them copy, not camera-ready copy, for a long book that has Sanskrit in Devanagari and an English translation. Of course, the files I provided the publisher are pdfs. Now, the publisher wants them in doc. When they try to cut and paste from the pdf to doc, none of the conjunct consonants are recognized in the doc file. I used the velthuis-sanskrit mapping, and I am wondering if using the RomDev mapping would make a difference. I somehow doubt it. Suggestions?

Gratefully,
Bob Hueckstedt
Zdenek Wagner
2018-01-30 21:02:22 UTC
Permalink
Hi,

this is not a matter of a TECkit mapping used as an input but a matter of
\XeTeXgenerateactualtext, it should help.


Zdeněk Wagner
http://ttsm.icpf.cas.cz/team/wagner.shtml
http://icebearsoft.euweb.cz
With a publisher’s permission I used xelatex to provide them copy, not
camera-ready copy, for a long book that has Sanskrit in Devanagari and an
English translation. Of course, the files I provided the publisher are
pdfs. Now, the publisher wants them in doc. When they try to cut and paste
from the pdf to doc, none of the conjunct consonants are recognized in the
doc file. I used the velthuis-sanskrit mapping, and I am wondering if using
the RomDev mapping would make a difference. I somehow doubt it. Suggestions?
Gratefully,
Bob Hueckstedt
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
Michal Hoftich
2018-01-30 21:29:00 UTC
Permalink
Hi Bob,

On Tue, Jan 30, 2018 at 6:57 PM, Hueckstedt, Robert A. (rah2k)
With a publisher’s permission I used xelatex to provide them copy, not
camera-ready copy, for a long book that has Sanskrit in Devanagari and an
English translation. Of course, the files I provided the publisher are pdfs.
Now, the publisher wants them in doc. When they try to cut and paste from
the pdf to doc, none of the conjunct consonants are recognized in the doc
file. I used the velthuis-sanskrit mapping, and I am wondering if using the
RomDev mapping would make a difference. I somehow doubt it. Suggestions?
You can try to compile your TeX file to HTML using tex4ht. The HTML
code can be then pasted to Word. Basic command to compile XeTeX file
to HTML is

make4ht -ux filename.tex

Development version of make4ht can compile also to the ODT format,
which can be opened directly in Word:

make4ht -ux -f odt filename.tex

It is possible that you will need some additional configurations for
the correct compilation. It depends on used packages or custom macros
in the document.

Best regards,
Michal



--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.
Ross Moore
2018-01-31 00:49:36 UTC
Permalink
Hi Robert and Michal,

On Jan 31, 2018, at 8:29 AM, Michal Hoftich <***@gmail.com<mailto:***@gmail.com>> wrote:

Hi Bob,

On Tue, Jan 30, 2018 at 6:57 PM, Hueckstedt, Robert A. (rah2k)
<***@virginia.edu<mailto:***@virginia.edu>> wrote:
With a publisher’s permission I used xelatex to provide them copy, not
camera-ready copy, for a long book that has Sanskrit in Devanagari and an
English translation. Of course, the files I provided the publisher are pdfs.
Now, the publisher wants them in doc. When they try to cut and paste from
the pdf to doc, none of the conjunct consonants are recognized in the doc
file. I used the velthuis-sanskrit mapping, and I am wondering if using the
RomDev mapping would make a difference. I somehow doubt it. Suggestions?

You can try to compile your TeX file to HTML using tex4ht. The HTML
code can be then pasted to Word. Basic command to compile XeTeX file
to HTML is

make4ht -ux filename.tex

This might work, but first I’d try using Acrobat Pro to save the PDF
directly into a Word document.

This *can* work really well, especially when the PDF is enriched with
some tagging and the correct ToUnicode CMap resources for the fonts.
Try it and see if the result is reasonable.

Alternatively, you can Export to HTML from Acrobat Pro; though I’d
expect that if the .doc export is no good, then the HTML export would
suffer from similar issues.

It may even be that Adobe Reader can do these exports now,
as it is the same code-base.


Development version of make4ht can compile also to the ODT format,
which can be opened directly in Word:

make4ht -ux -f odt filename.tex

It is possible that you will need some additional configurations for
the correct compilation. It depends on used packages or custom macros
in the document.

Best regards,
Michal


Hope this helps.

Ross


PS. if you don’t have access to Acrobat Pro to try this,
can you send me a few pages. I’ll then try it for you.
If the result is good, that may be sufficient reason for you
to consider investing in a license.


Dr Ross Moore
Mathematics Dept | 12 Wally’s Walk, 734
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955 | F: +61 2 9850 8114
M:+61 407 288 255 | E: ***@mq.edu.au<mailto:***@mq.edu.au>

http://www.maths.mq.edu.au


[cid:***@01D030BE.D37A46F0]<http://mq.edu.au/>


CRICOS Provider Number 00002J. Think before you print.
Please consider the environment before printing this email.

This message is intended for the addressee named and may
contain confidential information. If you are not the intended
recipient, please delete it and notify the sender. Views expressed
in this message are those of the individual sender, and are not
necessarily the views of Macquarie University.
Zdenek Wagner
2018-01-31 10:18:32 UTC
Permalink
No, neither make4ht nor Acrobat Pro helps. The reason is that generated PDF
contains glyphs but both copy&paste, save as text, save as word, or sace as
whatever requires characters and this information is lost. Moreover, the
order of glyphs does not follow the order of characters in the word. It can
only be solved by asking XeTeX to add /ActualText as a Unicode string to
each word and, of course, use a PDF viewer that does not ignore it. Adobe
Reader is sufficient. The extracted string then contains Unicode characters
and conversion to glyphs is then done by MS Word shaping engine (Uniscribe).

Note: see the well known name Siddhartha where dependent vowel (matra) I is
visually at the very beginning and R is visually at the very end.


Zdeněk Wagner
http://ttsm.icpf.cas.cz/team/wagner.shtml
http://icebearsoft.euweb.cz
Post by Ross Moore
Hi Robert and Michal,
Hi Bob,
On Tue, Jan 30, 2018 at 6:57 PM, Hueckstedt, Robert A. (rah2k)
With a publisher’s permission I used xelatex to provide them copy, not
camera-ready copy, for a long book that has Sanskrit in Devanagari and an
English translation. Of course, the files I provided the publisher are pdfs.
Now, the publisher wants them in doc. When they try to cut and paste from
the pdf to doc, none of the conjunct consonants are recognized in the doc
file. I used the velthuis-sanskrit mapping, and I am wondering if using the
RomDev mapping would make a difference. I somehow doubt it. Suggestions?
You can try to compile your TeX file to HTML using tex4ht. The HTML
code can be then pasted to Word. Basic command to compile XeTeX file
to HTML is
make4ht -ux filename.tex
This might work, but first I’d try using Acrobat Pro to save the PDF
directly into a Word document.
This *can* work really well, especially when the PDF is enriched with
some tagging and the correct ToUnicode CMap resources for the fonts.
Try it and see if the result is reasonable.
Alternatively, you can Export to HTML from Acrobat Pro; though I’d
expect that if the .doc export is no good, then the HTML export would
suffer from similar issues.
It may even be that Adobe Reader can do these exports now,
as it is the same code-base.
Development version of make4ht can compile also to the ODT format,
make4ht -ux -f odt filename.tex
It is possible that you will need some additional configurations for
the correct compilation. It depends on used packages or custom macros
in the document.
Best regards,
Michal
Hope this helps.
Ross
PS. if you don’t have access to Acrobat Pro to try this,
can you send me a few pages. I’ll then try it for you.
If the result is good, that may be sufficient reason for you
to consider investing in a license.
Dr Ross Moore
Mathematics Dept | 12 Wally’s Walk, 734
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955 <+61%202%209850%208955> | F: +61 2 9850 8114
<+61%202%209850%208114>
http://www.maths.mq.edu.au
<http://mq.edu.au/>
CRICOS Provider Number 00002J. Think before you print.
Please consider the environment before printing this email.
This message is intended for the addressee named and may
contain confidential information. If you are not the intended
recipient, please delete it and notify the sender. Views expressed
in this message are those of the individual sender, and are not
necessarily the views of Macquarie University.
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
Hueckstedt, Robert A. (rah2k)
2018-02-05 20:54:27 UTC
Permalink
Thank you all for your help with this matter. The answer, as many have already pointed out, is simply to add the line

\XeTeXgenerateactualtext=1

To the preamble of the header file. Then, as the publisher has assured me, all the Sanskrit in Devanagari can be copied out of the pdf file and pasted into doc, etc. to one’s heart’s content.

Gratefully,
Bob Hueckstedt

From: XeTeX [mailto:xetex-***@tug.org] On Behalf Of Zdenek Wagner
Sent: Wednesday, January 31, 2018 5:19 AM
To: XeTeX (Unicode-based TeX) discussion. <***@tug.org>
Subject: Re: [XeTeX] xelatex to doc?

No, neither make4ht nor Acrobat Pro helps. The reason is that generated PDF contains glyphs but both copy&paste, save as text, save as word, or sace as whatever requires characters and this information is lost. Moreover, the order of glyphs does not follow the order of characters in the word. It can only be solved by asking XeTeX to add /ActualText as a Unicode string to each word and, of course, use a PDF viewer that does not ignore it. Adobe Reader is sufficient. The extracted string then contains Unicode characters and conversion to glyphs is then done by MS Word shaping engine (Uniscribe).
Note: see the well known name Siddhartha where dependent vowel (matra) I is visually at the very beginning and R is visually at the very end.


Zdeněk Wagner
http://ttsm.icpf.cas.cz/team/wagner.shtml
http://icebearsoft.euweb.cz

2018-01-31 1:49 GMT+01:00 Ross Moore <***@mq.edu.au<mailto:***@mq.edu.au>>:
Hi Robert and Michal,


On Jan 31, 2018, at 8:29 AM, Michal Hoftich <***@gmail.com<mailto:***@gmail.com>> wrote:

Hi Bob,

On Tue, Jan 30, 2018 at 6:57 PM, Hueckstedt, Robert A. (rah2k)
<***@virginia.edu<mailto:***@virginia.edu>> wrote:

With a publisher’s permission I used xelatex to provide them copy, not
camera-ready copy, for a long book that has Sanskrit in Devanagari and an
English translation. Of course, the files I provided the publisher are pdfs.
Now, the publisher wants them in doc. When they try to cut and paste from
the pdf to doc, none of the conjunct consonants are recognized in the doc
file. I used the velthuis-sanskrit mapping, and I am wondering if using the
RomDev mapping would make a difference. I somehow doubt it. Suggestions?

You can try to compile your TeX file to HTML using tex4ht. The HTML
code can be then pasted to Word. Basic command to compile XeTeX file
to HTML is

make4ht -ux filename.tex

This might work, but first I’d try using Acrobat Pro to save the PDF
directly into a Word document.

This *can* work really well, especially when the PDF is enriched with
some tagging and the correct ToUnicode CMap resources for the fonts.
Try it and see if the result is reasonable.

Alternatively, you can Export to HTML from Acrobat Pro; though I’d
expect that if the .doc export is no good, then the HTML export would
suffer from similar issues.

It may even be that Adobe Reader can do these exports now,
as it is the same code-base.



Development version of make4ht can compile also to the ODT format,
which can be opened directly in Word:

make4ht -ux -f odt filename.tex

It is possible that you will need some additional configurations for
the correct compilation. It depends on used packages or custom macros
in the document.

Best regards,
Michal


Hope this helps.

Ross


PS. if you don’t have access to Acrobat Pro to try this,
can you send me a few pages. I’ll then try it for you.
If the result is good, that may be sufficient reason for you
to consider investing in a license.


Dr Ross Moore
Mathematics Dept | 12 Wally’s Walk, 734
Macquarie University, NSW 2109, Australia
T: +61 2 9850 8955<tel:+61%202%209850%208955> | F: +61 2 9850 8114<tel:+61%202%209850%208114>
M:+61 407 288 255<tel:+61%20407%20288%20255> | E: ***@mq.edu.au<mailto:***@mq.edu.au>

http://www.maths.mq.edu.au


[cid:***@01D39E99.99EFF0A0]<http://mq.edu.au/>


CRICOS Provider Number 00002J. Think before you print.
Please consider the environment before printing this email.

This message is intended for the addressee named and may
contain confidential information. If you are not the intended
recipient, please delete it and notify the sender. Views expressed
in this message are those of the individual sender, and are not
necessarily the views of Macquarie University.




--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xetex

ShreeDevi Kumar
2018-01-31 04:02:08 UTC
Permalink
As suggested earlier by Zdenek Wagner, use \XeTeXgenerateactualtext in
the preamble of your latex document. Then the Devanagari text can also be
easily copied from the generated pdf. We use this option in pdfs posted on
sanskritdocuments.org and it works very well.

ShreeDevi
____________________________________________________________
à€­à€œà€š - à€•à¥€à€°à¥à€€à€š - à€†à€°à€€à¥€ @ http://bhajans.ramparivar.com

On Tue, Jan 30, 2018 at 11:27 PM, Hueckstedt, Robert A. (rah2k) <
With a publisher’s permission I used xelatex to provide them copy, not
camera-ready copy, for a long book that has Sanskrit in Devanagari and an
English translation. Of course, the files I provided the publisher are
pdfs. Now, the publisher wants them in doc. When they try to cut and paste
from the pdf to doc, none of the conjunct consonants are recognized in the
doc file. I used the velthuis-sanskrit mapping, and I am wondering if using
the RomDev mapping would make a difference. I somehow doubt it. Suggestions?
Gratefully,
Bob Hueckstedt
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
ShreeDevi Kumar
2018-01-31 04:35:10 UTC
Permalink
Correct syntax is

\XeTeXgenerateactualtext=1

ShreeDevi
____________________________________________________________
Post by ShreeDevi Kumar
As suggested earlier by Zdenek Wagner, use \XeTeXgenerateactualtext in
the preamble of your latex document. Then the Devanagari text can also be
easily copied from the generated pdf. We use this option in pdfs posted on
sanskritdocuments.org and it works very well.
ShreeDevi
____________________________________________________________
On Tue, Jan 30, 2018 at 11:27 PM, Hueckstedt, Robert A. (rah2k) <
With a publisher’s permission I used xelatex to provide them copy, not
camera-ready copy, for a long book that has Sanskrit in Devanagari and an
English translation. Of course, the files I provided the publisher are
pdfs. Now, the publisher wants them in doc. When they try to cut and paste
from the pdf to doc, none of the conjunct consonants are recognized in the
doc file. I used the velthuis-sanskrit mapping, and I am wondering if using
the RomDev mapping would make a difference. I somehow doubt it. Suggestions?
Gratefully,
Bob Hueckstedt
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
Loading...