Overview

The DISDAR extraction service is a RESTful web service that is able to automatically and reliably extract semantic information out of documents. The API is extremely easy to integrate into existing solutions and is able to handle massive amounts of data.

DISDAR currently specializes in German invoices but will extend the service to support additional document types, more items, and other languages. To achieve this goal, DISDAR is actively developing state-of-the-art machine learning techniques that enable computers to understand documents as reliably as humans do.

Authentication

Example for an authorized request using curl

curl -H 'x-api-key: YOUR-API-KEY'
     -H 'content-type: application/json'
     -X POST -d '{"fileURL": "https://disdar.com/example-document.pdf"}'
     'https://api.disdar.com/v2/extractions'

To communicate with our servers, you need to provide your API key with each request by setting the “x-api-key” header. To obtain an API key, please contact us at info@disdar.com.

x-api-key: YOUR-API-KEY

Extraction

Example of a valid extraction request

{
    "fileURL" : "https://myserver.com/static/document.pdf"
}

Example of an extraction response

{
    "id": "41ecd4f2-d5a8-11e5-b5c9-17f7e2659337",
    "confidence": 0.99,
    "extractions": [{
        "value": "14.03.2016",
        "page": 0,
        "labels": [{
            "value": "2016-03-14",
            "type": "INVOICEDATE",
            "confidence": 1.0
        }]
    }]
}

Request

POST https://api.disdar.com/v2/extractions

A POST request to this endpoint starts the analysis of the provided document. The request must contain a JSON body with the property “fileURL” which identifies the document that is supposed to be analyzed. The URL must be publicly accessible.

Supported file types for documents are PDF, JPEG, PNG and TIFF. The size of the file must not exceed a limit of 50 MB and must not have more than 5 pages.

Response

The response body contains the result of the successful document analysis. In case anything went wrong the DISDAR API will return a 4xx or 500 response code. For more information on possible errors see Error Codes.

The result consists of an unique request ID, the overall confidence and a list of ‘extractions’ that the analysis has extracted from your document. Each extraction contains a list of labels sorted by confidence. A label describes the ‘type’ that we have determined for that extraction. Note that extraction.value contains the extracted String exactly as found on your document while extraction.labels[i].value contains a formatted representation of that value depending on the label type we have detected for that extraction (i.e. an invoice date is represented according to the ISO8601 specification, see Supported Items for more details).

Supported Items

Examples of the JSON structure of the supported labels

{ "type": "AMOUNT",
  "value": 1190,
  "confidence": 1.0 }

{ "type": "NETAMOUNT",
  "value": 1000,
  "confidence": 1.0 }

{ "type": "TAXRATE",
  "value": 19.00,
  "confidence": 1.0 }

{ "type": "IBAN",
  "value": "DE12345678901234567890",
  "confidence": 0.977 }

{ "type": "BIC",
  "value": "COBADEFF",
  "confidence": 0.99 }

{ "type": "INVOICEDATE",
  "value": "2015-10-27",
  "confidence": 1.0 }

{ "type": "INVOICENUMBER",
  "value": "R123",
  "confidence": 1.0 }

{ "type": "CREDITORNAME",
  "value": "DISDAR GmbH",
  "confidence": 1.0 }

{ "type": "LOGO",
  "value": "DISDAR",
  "confidence": 1.0 }

{ "type": "TAXID",
  "value": "37/494/22194",
  "confidence": 1.0 }

{ "type": "VATID",
  "value": "DE291653251",
  "confidence": 1.0 }

The current version of the DISDAR extraction engine supports automatic extraction of payment information from invoices and bills. We are continuously extending the set of information we can extract.

Please contact us at info@disdar.com to propose new features necessary for your usecase.

Label Description
AMOUNT Total gross invoice amount in cent. Formatted as an integer number.
NETAMOUNT Total net amount in cent. Formatted as an integer number.
TAXRATE VAT tax rate applied to the net amount. Formatted as a decimal number with two decimal places.
IBAN IBAN of the account the total amount should be transferred to. Formatted as a string without any whitespace or other separators.
BIC BIC of the account the total amount should be transferred to. Formatted as a string without any whitespace or other separators.
INVOICEDATE The invoice date. Formatted as a string according to the ISO8601 specification.
INVOICENUMBER The invoice number. Formatted as a string.
CREDITORNAME Name of the issuer of the invoice. Formatted as a string.
LOGO Name contained in the logo on the invoice. Formatted as a string.
TAXID Tax ID of the creditor. Formatted as a string.
VATID VAT ID of the creditor. Formatted as a string (uppercase letters and digits) without any whitespace.

Error Codes

Example of an error response JSON body

{
    "errorCode": "UNSUPPORTED_MEDIA_TYPE",
    "errorMessage": "Currently only JPG, PNG, PDF and TIF are supported."
}

The DISDAR API returns a 4xx or 500 response code in case anything went wrong. The error response body contains the two properties “errorCode” and “errorMessage”. The following error codes exist:

Error Code Description
400 BAD_REQUEST The request payload does not have the required format (see Request).
FILE_NOT_ACCESSIBLE The document could not be retrieved from the URL specified in the request. Details are provided in the errorMessage.
UNSUPPORTED_MEDIA_TYPE The media type of the file is not supported (see Request).
FILE_SIZE_LIMIT The file exceeds the supported file size limit of 50MB.
PAGE_COUNT_LIMIT The file exceeds the page count limit of 5 pages.
403 FORBIDDEN You did not provide a valid API key (see Authentication).
422 UNPROCESSABLE_ENTITY The document you provided could not be processed by our pipeline. The main reason for this error is a corrupt document (e.g. a malformed PDF file).
429 TOO_MANY_REQUESTS You are sending too many documents at a time. We recommend not to send more than 10 documents per second but we are able to handle bursts of up to 50 documents per second. If you want to raise the current limit please contact us at info@disdar.com.
500 INTERNAL_ERROR An internal error occurred. If possible report the provided ID to info@disdar.com.

Versioning

Example of a backwards compatible modification to the response:

{
    "id": "41ecd4f2-d5a8-11e5-b5c9-17f7e2659337",
    "additionalNewProperty": "1234",
    "confidence": 0.99,
    "extractions": [{
        "value": "abc",
        "page": 0,
        "labels": [{
            "value": "abc",
            "type": "NEW_LABEL",
            "confidence": 1.0
        }]
    }]
}

The current version of the DISDAR API is v2. This is encoded in the URL, i.e. https://api.disdar.com/v2/. We keep this version backwards compatible by keeping the schemas for all requests and responses stable, i.e. we make sure to never change name, datatype and semantic meaning of any existing property.

Note that we consider adding new properties to the query and response objects as backwards compatible changes, i.e. your code should be able to gracefully handle unknown properties.