Skip to content

Upload large files in Integrated Data Lake

The Integrated Data Lake provides an API interface to upload files via Insights Hub Gateway. Currently, API restricts the user from uploading the files larger than 100 MB. This presents a challenge by hindering the seamless transfer and storage of large data files, by impacting user experience and restricting the system's capacity to address a variety of data types and user requirements.

A multipart upload allows an application to upload a large file in smaller parts. After uploading, the application combines the smaller parts into the original larger file. The advantages of uploading a large file in smaller parts are:

  • Improves the performance by uploading many smaller parts in parallel.
  • Recovers from a network error more quickly by restarting the upload for the failed parts.

Note

Once multipart request is initiated for a file
- File with the same name cannot be uploaded (single file upload or another multipart request).
- Metadata operations (add or update) on the file will be blocked till multipart request is completed.
- File with the same name cannot be deleted from the storage.

To upload a large file in Integrated Data Lake, follow these steps:

  1. Initiate the multipart upload and obtain an "uploadId" via the POST API call.

    • The user initiates the multipart request, valid only for 24 hours. All the parts related to the file are uploaded in 24 hours. After 24 hours, multipart request gets expired and the uploading is performed again.
    • "uploadId" is provided to the user. All the parts are uploaded only under "uploadId", valid only for 24 hours.
    • Once the multipart is initiated, "timeout" is provided in the response which is configured only for 24 hours.

    Endpoint:

    POST /multipart
    

    Request example:

    {
        "metadataKeyValues": [
          {
            "key": "country_of_origin",
            "metadataCollectionId": "document_review",
            "values": [
              "IN"
            ]
          }
        ]
    }
    

    Response example:

    {
        "path": "data/ten=punint01/sp165/bigfile.log",
        "uplaodId": "5e36d55270074d65acb1393463c30a86",
        "timeout": "2024-01-19 09:01:45",
        "status": "IN_PROGRESS"
    }
    
  2. Divide the large file into multiple parts and upload the parts of a large file in parallel via the PUT API call.

    • Once the multipart is initialized, user can upload the parts under the provided "uploadId" in the response of Initiate Upload Request.
    • The single part should be a minimum of 50 MB in size and a maximum up to 100MB.
    • A maximum of 10,000 parts can be uploaded in a single multipart request.
    • Every part when uploaded successfully, returns "partCommitId" in response.

    Endpoint:

    PUT /multipart/{uploadId}
    

    Request example:

    {
        "metadata": {
          "metadataKeyValues": [
            {
            "key": "country_of_origin",
            "metadataCollectionId": "document_review",
            "values": [
              "IN"
            ]
          }
        ]
      },
      "content": "string"
    }
    

    Response example:

    {
        "name": "test1.csv",
        "path": "/folder1/test1/test1.csv",
        "location": "ten=punvs1/folder1/test1/test1.csv",
        "lastModified": "2018-10-03T09:21:36.559Z",
        "size": 25,
        "metadataKeyValues": [
          {
            "key": "country_of_origin",
            "metadataCollectionId": "document_review",
            "values": [
              "IN"
            ],
            "isPropagated": true
          }
        ]
    }
    
    • List multipart file upload requests: This GET method is used to list all the multipart requests that are initiated along with their status.

    Endpoint:

    GET /multipart
    

    Sample Response:

    {
        "multipartRequests": [
          {
            "path": "/floor1/machine/logs.txt",
            "uploadId": "string",
            "status": "COMPLETED"
          }
        ]
    }
    
    • List uploaded parts within a multipart request: This GET method is used to list all the parts upload status for the given "uploadId" along with the status of the uploadRequest, parts for which upload is completed.

    Endpoint:

    GET /multipart/{uploadId}
    

    Sample Response:

    {
        "path": "/floor1/machine/logs.txt",
        "status": "COMPLETED",
        "parts": [
          {
            "partNumber": 1,
            "partCommitId": "string"
          }
        ]
    }
    
  3. Complete the upload by calling the POST API call to mark the completion of the upload.

    • Once all the parts are uploaded, user can call complete multipart API to close the multipart request.
    • User can provide the list of parts that combines to a single file along with "partCommitId".
    • Only complete request parts are considered for combining the parts.

    Note

    Since only 10,000 parts are allowed, the maximum file size that can be uploaded could be restricted to 800 to 900 GB.

    Endpoint:

    POST /multipart/{uploadId}/complete
    

    Sample Request:

    {
        "parts": [
          {
            "partNumber": 1,
            "partCommitId": "string"
          }
        ]
    }
    

    Sample Response:

    {
        "name": "test1.csv",
        "path": "/folder1/test1/test1.csv",
        "location": "ten=punvs1/folder1/test1/test1.csv",
        "lastModified": "2018-10-03T09:21:36.559Z",
        "size": 25,
        "metadataKeyValues": [
          {
            "key": "country_of_origin",
            "metadataCollectionId": "document_review",
            "values": [
              "IN"
            ],
            "isPropagated": true
          }
        ]
    }
    
  4. Abort the upload by terminating the multipart file upload process.

    Endpoint:

    DELETE /multipart/{uploadId}
    

    Note

    Multipart request cannot be terminated if it is expired.