Welcome to Sab-AI lab

A boutique AI lab in Nagoya-Japan.

PDFs are notoriously difficult to scrape. This program converts them to *.txt or *.html formats. The program has tested for Latin alphabets and Japanese.

The narrative lays out the technology's scope of works, accuracy, the-best-use and way-forwards.

...

Datasets and models download:

By downloading this source code I acknowledge that I have fully read and understood the below system's scope and description as well as its behaviour/acceptance test criteria in its entirety and am considering all requirements when I build upon/use the system to keep it performing as expressed.

...

note: This program cannot open encrypted PDF, Before using this program you need to decrypt your pdf file

...

...

Converter-pdf-files-to-.txt-or-.html

I built this package on the work of Gorkovenko (Stanford University) and Greenfield (Harvard University) to make pdfminer.six available for Python versions 3.x. PDFs are notoriously difficult to scrape. Converting them to text files can make extracting their data significantly easier. There are several tools out there to help you do this, but I will focus on the one that I think is the best and easiest to use: pdfminer.six Converting *.pdf to *.txt or *.html I made a standalone executable version of the package ready testpdf2txt.exe. You could download and use it even if you do not have python 3 installed on your machine.

This is the results from an improvement work on a project called the Mysolution information extraction algorithm from unstructured datasets with an overall accuracy of 99% .

please download ---testpdf2txt.exe--- the click above.

You can save the program anywhere in your computer and run it by double-clicking on it directly from your machine.

  • Put your PDF file in a folder,

  • Double-click the program and follow the instruction on the screen,

  • You may save *.txt and *.html in a different directory, please enter the path to those directory if you wish.
  • Enter the filename of your PDF.

...

A quick perfrmance report on ML

Dataset f1 Accuracy Precision Recall (Sensitivity)
For non-native 74% 72% 78% 76%
For Japanese-English speaker 78% 79% 81% 78%

...

The PDF-to-TXT-HTML source code is licensed under MIT General Public License

...

Contact us

Office

〒466-0834 Hirojichō, Umezono Nagoya City Aichi. Japan

sabailabo@gmail.com

Sab-AI Lab 愛知県 名古屋市 昭和区 広路町字梅園 10-4