PDF Reader - highlighted text contains no spaces with some PDFs
I have just noticed that in the beta PDF Reader *for some PDFs* highlighted text is shown in the annotations panel without spaces - i.e.all text is run together.
When I open those same PDFs in other PDF readers, highlighted text is shown correctly, with spaces.
I cannot see any differences between PDFs that the PDF Reader handles correctly and those that it doesn't - has anyone else noticed this, or can help with diagnosis?
188BET靠谱吗Zotero 5.0.97-beta.33+fdcd4e51c on Windows 10
When I open those same PDFs in other PDF readers, highlighted text is shown correctly, with spaces.
I cannot see any differences between PDFs that the PDF Reader handles correctly and those that it doesn't - has anyone else noticed this, or can help with diagnosis?
188BET靠谱吗Zotero 5.0.97-beta.33+fdcd4e51c on Windows 10
If those PDFs are scanned, a newer and more advanced OCRing software could help to replace text layers.
"plagueofbedbugsandfamilyillness.Attheotherendofthecountry,acountessispaintingbotanical"
and PDF-XChange Editor giving me:
"plague of bed bugs and family illness.At the other end of the country, a count
ess is painting botanical"
As far as I can see, they two programs must be interpreting what they pull out of the text layer differently.
188BET靠谱吗i would like to renew the interest in this issue, which remains with the 6.0.11 version of Zotero (i am on mac with OS Monterey 12.2.1).
i found a PDF whose text when imported into an editor shows only 'stringed' words (a group of words make 1 single string), while when the same text is imported into a word processor shows both spaced and stringed words.
when i paste the text into Sublime text, it appears that the stringed words are separated by 1 <0x2029> character, while the spaced words are separated by 2 characters: <0x2029> and a usual space.
apparently in UTF-16 <0x2029> represents Unicode Character 'PARAGRAPH SEPARATOR'
could you the developers manage the conversion of <0x2029> into "space" for the needs of the users?
best
Maurizio
:-)
and the trouble of words not separated when importing from annotations into notes happens also with other PDFs
it happens because at the end of every word there is a UTF-16 <0x2029> which represents Unicode Character 'PARAGRAPH SEPARATOR'
and preview properly treats every word as a 'finished' paragraph.
the minimum is to change <0x2029> into spaces when one selects an annotation for extraction as note.this way a true 'PARAGRAPH SEPARATOR' is lost, but it is not terrible given that one generates excerpts of text
Yes, we can probably treat paragraph separator as a space, but usually we try to avoid fixing other PDF exporters random bugs.
in fact i saw that when this happens, i can paste the flawed text extraction into a programmer's editor, search for <0x2029> and replace it with space.
not terrible.
i understood that this problem already appears when you draw the pointer to select an area of text: if the text is left and right justified but the highlight of the selection is narrower than the visual margins of the text, then the text contains <0x2029> characters