Aws pdf to text

3/12/2023

Print(f"done where (textract is null)""") # Make sure not unavailable Self.database_connection = database_connectionĬursor = self.database_connection.cursor()Ĭursor.execute("""Select rds_uid from property_information where (cfn is not null) and (cfn != 'Unavailable') and (textract is null)""") # Make sure not unavailableĬursor.execute("""Select rds_uid from liens where (textract is null)""") # Make sure not unavailableįor num,rds_uid in enumerate(self.records):įull_text = self.scan_pdf(rds_uid, textract_client) Textract_client = boto3.client("textract")ĭef _init_(self, list_of_records, database_connection, pdf_number, document_type): GOAL: I want to be able to increase our ability to textract as many documents as we can because being capped at 15 is a joke and won't work. At the scale of 1 million + documents the costs/time does add up. We are a small nonprofit but are designing a repeatable workflow for other datasets of this size or larger. So I'm also wondering if there are efficiencies between detect_document_text() and start_document_text_detection(). I get errors when I try to run detect_document_text() on my multi-page TIFs, but I am for some reason able to run detect_document_text() on at least some single-page TIFs. I'm wondering if it makes more sense to split the pages into separate images before I use Textract to OCR, or if there are efficiencies in cost, speed, rate-limiting or something else that would be gained by not splitting the pages until after I OCR.

After the Step Machine has run on all of our raw images, I am using Python to collect the results and load pointers to the processed images, text, and JSON into a database. I have built a Step Machine with different lambdas that handle different parts of the pipeline. I have a pile of 1 million+ TIF files (single and multi-page) that I need to OCR, search for particular terms, and then by the end of my workflow, have images and text for each individual page.

0 Comments

Aws pdf to text

Leave a Reply.

Author

Archives

Categories