188BET靠谱吗Zotero, OCR plugin and OCR file not indexed.

188BET靠谱吗Zotero beta 114, 64-bits, Zotero OCR 0.7.3 plugin.
360 pages image pdf
OCR done.
188BET靠谱吗output pdf file and output image files including output txt, hocr files present near source image pdf but not visible in Zotero.
188BET靠谱吗How to index these files, why not appear near the input file and not visible in Zotero?
  • edited 14 days ago
    188BET靠谱吗[After using the plugin] Zotero only knows about the PDF and some HTML!188BET靠谱吗the images, hocr and others are not intended to be indexed (someone might want to check them for testing purposes, or use them outside of Zotero).
    188BET靠谱吗The expected result within Zotero is a PDF with a text layer, that can be fully indexed.Did this work for you?
  • No.188BET靠谱吗Zotero, after OCR still displays only the source image pdf.
    188BET靠谱吗Outside of Zotero, in total commander I see in the same folder as source image file: .zotero-reader-state, OCRed PDF, hocr, txt, image-list.txt, page1.png to ...page 360.png and the source image PDF file.
    Now in this situation I don't see any way to index these files except copying out all files from the folder except the source pdf and draging back the output pdf to index it.It this enough to complete the action?

    188BET靠谱吗https://s3.amazonaws.com/zotero.org/images/forums/u587761/kkox4o0nbtrp4yll89lz.png
  • edited 14 days ago
    Thanks for the screenshot!Something isn't normal, the OCRed PDF should be attached to the Zenodo record.

    Is the original image PDF a regular attachment, or a linked file?
    Is the record saved in your personal library or in a group library?

    You can of course manually add the output PDF, it should solve the immediate problem.I'd be very happy if we could get to the root cause - in case others encounter it as well.
  • The original PDF is a regular attachment.
    Personal library, single user, multiple profiles,
    Source pdf 141mb, output OCRed PDF 451mb.
  • It really looks fine, except for this missing attachment problem.It's a larger file than the ones I've tried before, but I don't know if that could have any impact.

    Is the output PDF OK?Are all pages present and OCRed?
    Did the plugin create a note, as selected in the preferences?

    It would be great if you can:
    188BET靠谱吗- run the plugin on a fresh copy of your Zotero record (i.e.with only the source PDF), but this time selecting yes to saving html/hocr and no to the PNGs;
    - check whether the html files are attached to the record (they should), and whether the PNGs are left behind in the folder (they should not).

    Finally, if necessary, would you agree to share the source PDF with me?
  • Yes, the output PDF is all OCRed.
    No note is created by the plugin.
    I will check with smaller PDF as you described to view the results.
    Thanks.
  • It seems that something bad is happening just after the OCR is completed - the note creation is the next step in the code.

    How large is the txt file that was produced by the OCR?
  • edited 14 days ago
    188BET靠谱吗OK, that's not huge but maybe longer than Zotero likes.
    Another test that would be useful: disable "Save output to a note" for your 360-page PDF, if things are working correctly that we'll have a winner!
  • edited 14 days ago
    188BET靠谱吗I used the same Zotero 7 portable, downloaded a test PDF from web, 8 pages.
    I ve put in a test folder.After OCR, no other file except the source file see in test folder.188BET靠谱吗This time outside of folders in Zotero library I can observe a note with the extracted text, output ocr pdf and 5 html files with 5 ocred pages.
    188BET靠谱吗I checked the folder outsite of Zotero in Total Commander: I observed that 8 image files created and not added to Zotero.If these are not attached only the output html files, then it must be 8 html files not 5.
    Other obs: The OCR process window closes after OCR.
    188BET靠谱吗Imagine if I OCR a 3000 pages book PDF and it creates (if it works like with small pdfs) 3000 html files outside the folder I put the source PDF, uncategorized in Zotero library.
  • edited 14 days ago
    Please run the tests I have proposed with the 360-page PDF, that's where the real problem is.We can discuss the other aspects later, but for now it only adds complexity to the issue :-)
  • I do the: "selecting yes to saving html/hocr and no to the PNGs".
    188BET靠谱吗restart Zotero.
    done the OCR.
    Same results.PNGs created.Exact the same symptoms.It is a public PDF, you can check and test it: https://archive.org/details/Psb22
    188BET靠谱吗https://s3.amazonaws.com/zotero.org/images/forums/u587761/ioc5x5zzcsre4ih30dw3.png
  • I disabled the save output as a note.But the same results.
  • Thanks for the link to the document, that was very useful.
    188BET靠谱吗On my machine, everything has worked as expected - the note is created, but cannot be synchronized with the Zotero server because it is too long.So it is probably not the best idea to create them for long documents in general, but your tests indicate that this is not the original cause I suspected.

    Now about your 8-page document:
    1) do I understand correctly that the images, html and other files were created by the plugin in a different folder than the one containing the source PDF?That would be new, I haven't read that in your first messages - did I miss something?

    2) number of html files: it is controlled by the plugin settings, 5 is the correct number here as per your last screenshot.You will only get a html file for each page if you set the preference to more than the number of pages in your document.

    3) closing the OCR process window (I assume you mean tesseract): I think it is normal on MS Windows.There will be no window at all on MacOS, Linux, etc.

    4) You don't need to generate any html file if you don't want to, or just a few if you prefer, so a 3000 pages document doesn't automatically mean 3000 html files.Now if they are generated in an unexpected folder, that's still a problem - see question 1.
  • Hi.Thanks for the detailed answer.
    188BET靠谱吗I use Zotero only offline without connecting to Zotero server.
    8409 items, tons of pdf attachments, 361gb total, sqlite 232mb.

    1.the 8 page document: the number of html pages to create is set to 5 probably this was the reason it creates 5 and not 8.I did not test it yesterday.Yes the plugin always creates files in the main list of files, uncategorized and not where the source file is.

    2.yes probably as you say.i did not test it.

    3.yes I considered normal.just mentioned it closes and finishes the OCR process even the problems remains the same.

    4.any file generated after ocr-ing small pdfs is put in the main uncategorized list not in the source pdf' folder.
    Even I set it so, at big pdf ocr, html files are not generated, png files are created even if I choose not to create.
  • edited 13 days ago
    1.That's an important piece of information, thank you!188BET靠谱吗If the files are not created in the same folder as the source PDF, I guess the plugin will not report them properly to Zotero.This could be the actual cause of the problem.Can you tell me the full paths to the source PDF and to the output files for one of your tests?

    4.Intermediate PNG files will always be created, the OCR is performed on them.But if the "Save the intermediate PNGs..." setting is not selected, they will be deleted at the end of the process.In your case, the plugin fails before that step so this clean-up doesn't happen.

    188BET靠谱吗I have been testing only with a regular Zotero installation, not a portable version - I will try that next.
  • 1."physically" the files appear in the same folder (ex.188BET靠谱吗R34ZQQ5P, etc.) that Zotero generates for a single item in total commander in both cases (small source pdfs that have the generated files visible in Zotero in the main uncategorized list and large source pdfs with ocred files that does not appear in Zotero at all.
    4."Save the intermediate PNGs..." unselected, png files are not deleted..yes you are right.
  • 1.Oh, I had misunderstood your point, thanks for explanation.I will look into this,
  • I haven't asked before, and maybe it's a silly question, but does the source PDF have a parent item (i.e.188BET靠谱吗right-hand pane with title, author, document type, etc.), or is it stored as an isolated file in your Zotero library?
  • edited 13 days ago
    An isolated file without parent item (I think because it is an image pdf) without any indexation in a test named folder.
  • edited 12 days ago
    Thanks for the quick response!188BET靠谱吗That's the main problem, then: the plugin needs a parent item to work properly (like many standard Zotero functionalities).Right-click and select "Create Parent Item" as per
    188BET靠谱吗//www.brodersterzo.com/support/adding_items_to_zotero#pdfs

    188BET靠谱吗There is indeed no metadata that Zotero can recognize in your image PDF, that's why the parent was not created automatically.This means that you'll need to add at least some minimal information manually.Another possibility would be to find the book in a library catalog, import the metadata from there and attach the PDF to that record.

    I will make a note about this for a future version of the plugin.Maybe it would be a good idea to check for such a case, and create some kind of parent if there is none.

    [Edited to add] I'm not saying that creating a parent item is the final solution, but it will at least keep all output elements together.It will then be easier to verify if something doesn't work as it should.
  • You solved the problem.I created the parent item for the image pdf.Launched the OCR plugin.It now added the created OCRed pdf, the 5 defaultly set htmls.So it works now.Thank you for you time and patience.
  • No problem - thanks for helping me to understand the issue!
  • edited 7 days ago
    Hi.Interesting, the output OCRed PDFs are very large.Ex.37mb input PDF, 1.2GB output OCRed PDF.I also have 1.2GB pdf, I did not OCRed it yet, which in this way could produce a 40Gb output PDF maybe.

    The HTML generation is not enabled.Still the plugin produces 5 html files each time.I select and delete it, it is sent to trash.I delete it from there too, but the HTML files still remain in the folder where the input, output PDFs are (I checked with Total Commander).188BET靠谱吗This is related with Zotero, not the plugin.How to correct it?

    188BET靠谱吗https://s3.amazonaws.com/zotero.org/images/forums/u587761/bzzvb8e3180zit3n7e7k.png

    188BET靠谱吗If Zotero does not delete files phisically, it is a big problem in time, leading the data folder extremelly large quickly.
  • edited 7 days ago
    The size of the output PDF is a known issue 188BET靠谱吗https://github.com/UB-Mannheim/zotero-ocr/issues/42, we'll probably be able to improve that eventually.

    HTML files created while the option is not selected: I'll check.

    Deleting attachements vs.188BET靠谱吗deleting files: there is actually something we can do in the plugin to improve that (the files would be deleted when you empty the Zotero trash).It is part of some new code that is under review at the moment - it is supposed to address a different problem but I see it would also have a positive impact here :-)
  • edited 7 days ago
    Thanks.I have an 1578 pages 871Mb scanned pdf I will OCR it to see the output size.
    ...
    It took more than 5 hours to generate the outpout which is 3.73Gb size.
Sign Inor Registerto comment.