07.18
12

Apache Tika 1.2

by admin ·

The new version of the library Apache Tika , designed to extract text, attachments, and metadata in various common formats. At the moment the library supports dozens of formats, including Office documents Microsoft (OLE, and OOXML), OpenDocument, PDF and others.

Major changes in this version:

  • Added server mode, allowing access Tika using a simple REST API (HTTP)
  • Large improvements in the support document Apple iWork
  • The new library to determine the language and encoding of text-based algorithm for determining language / encoding Mozilla.org

  • Decompression XZ and Pack200
  • Ability to specify a password to decrypt the encrypted document to the command-line utility
  • A lot of errors, including a problem with removing not-OLE investments in office documents

Tags: ,

Leave a Reply