![]() Requires pdftotext from the poppler utilities. Example below: '''Extract text from PDF files. Using the -layout option, you basically get a plain text back, which is relatively easy to manipulate using Python. Like this i have bag of words pattern for each page based on the bag of words set pattern i need to extract pages. For extracting text from a PDF file, my favorite tool is pdftotext. Pdf page 2 will contain Income, Expense and Savings, now using all these 'Income' and 'Expense' and 'Savings three keywords 'i need to extract pdf_page2. Keyword_list = įor example: pdf page1 will contain Profit and Loss, now using these 'Profit' and 'Loss' two keywords, i need to extract pdf_page1. I need to find and match two or three keywords in a pdf page and extract that pdf_page from the pdf document, please kindly help me out in solving this (Similarly "MATHS" keyword pages should skipped and the pages containing "MATHS" keywords should not appear in "Geometry.pdf") In the same way i need a new "Geometry.pdf" document output which contain only the pages having "GEOMETRY" keyword related table information. (Please kindly Note: "file1.pdf" may also have a search keyword like "GEOMETRY" which is not the pages of interest here in my case so in new "Maths.pdf" document output only pages having "MATHS" keyword related table contents should be present and "GEOMETRY" should be skipped and the pages containing "GEOMETRY" keywords should not appear in "Maths.pdf") My searchwords will be ("MATHS","GEOMETRY".) Now I need a new "Maths.pdf" document output which contain only the pages having "MATHS" keyword related content information. "MATHS" will be present as a unique keyword in "file1.pdf", "GEOMETRY" will be present as a unique keyword in "file2.pdf" and soon. ![]() Yes correct this was my question exactly but with some changes i quoting again Take MATHS and take every page from every PDF where it occurs and put that page in a new MATH pdf. ![]() I should search for either “MATHS” or “CALCULATIONS” or “GEOMETRY” or “ANALYTICAL” which can work for multiple pdf files and extract those pages containing these strings, without duplication of pages. Same way in another pdf document, one page may contain “GEOMETRY” as a search string, that particular pdf page should be extracted using this search string. But my objective is to search for only one particular string for example like “MATHS” or “CALCULATIONS” or “GEOMETRY” or “ANALYTICAL” or “VALUES” from many huge pdf documents having more than 300 pages and after finding that only one particular string i need to extract or get that particular pdf page alone from those documents.įor example In one pdf document a page may contain “MATHS” as a search string, using that string, pages from the pdf document should be extracted. Print("%s found on page_no %i" % (search_item, this_page)) Search_item = "MATHS, CALCULATIONS, GEOMETRY, ANALYTICAL, VALUES".replace(", ", "|")įor this_page in range(len(pdf_document)):
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |