Implement File Upload and Download Handling for Streamflow

assigned to @vgeo

changed the description

Working on the implementation of the input file upload. I've managed to programmatically list the clusters, namespaces, PVCs, jobs and access the PVCs using temporary pods. The uploading process is the next step. For now, I’m working with the existing PVC job-data for the implementation.

The upload file process is completed. It currently uses static values to upload a file into the existing job-data PVC, placing it under the directory structure user_id/job_id/execution_id/input.

https://gitlab.ebrains.eu/ri/tech-hub/platform/workflows/streamflow-cwl-operations

The implementation of downloading a file from the data proxy directly to the appropriate folder in the PVC has been completed. It is significantly faster than uploading it from a local PC. For the same file of size 150 MB, the direct download took 7 seconds, while the upload took 130 seconds.

added in-progress label

The execution of a simple CWL workflow (tested with echo/1) programmatically using Python has been completed. Next steps:

Implement uploading of the output file to Data-Proxy
Execute the PSD CWL workflow end-to-end programmatically
Make the code parametric to support execution of other CWL workflows
Extend and improve the code with checks for input files, formats, and other validations

Updates:

I have completed the uploading of the output file to Data-Proxy. However, this functionality cannot yet be tested due to authentication issues with the Data-Proxy.
I have completed the capability to download the output file locally, which works as expected.
I have completed the dynamic generation of the kustomization.yml and env files needed for the execution.
I have ran the PSD end-to-end with the assumptions mentioned below.

WIP:

Developing the logic for the dynamic creation of the streamflow.yml file, which is the actual executable definition and must be generated based on the structure of the CWL files involved.

Current Assumptions:

Input and output files must exist directly under the root/job_id/ directory. If input files are placed under job_id/execution_id/, the CWL workflow fails to locate them. This will be investigated in a later stage.

Key Comments:

The input.yml file must contain specific variables required for generating the kustomization.yml and env files manually. Clear instructions will need to be provided to users once this setup is finalized.
The main CWL file must be named work.cwl to be correctly recognized and executed.

@alexakis Code is here: https://gitlab.ebrains.eu/ri/tech-hub/platform/workflows/streamflow-cwl-operations with README

@alexakis have you had the chance to check the way to run the worklfow via streamflow with assumptions provided by @vgeo ? We need to further sync about next steps, revisit points about architecture (https://gitlab.ebrains.eu/ri/tech-hub/devops/base/-/issues/266) and the provision of API for Dashboard needs.

Will you have the time to check these by next week (roughly 10,11 of July) and then we arrange a sync meeting for further actions with more team members ?

I pushed the latest version of the code to the main branch. Compared to the previous commit, I have added the functionality for uploading workflow output files to the data proxy, cleaned up the code, and updated the README file to make it fully informative. The use of symbolic links is still pending, as it hasn't worked yet, but I pushed the code since this is not blocking.

assigned to @alexakis and unassigned @vgeo

removed in-progress label

added in-review label

mentioned in issue #2

I have implemented the symbolic link mechanism to avoid re-downloading input files.

The process works as follows: It computes the hash of the URL, downloads the file to the path mnt/cache/hash_, and creates a symbolic link at mnt/job_id/input/.

I've pushed this implementation to a separate branch with the function in comment because although the mechanism works, Streamflow is currently unable to read the file through the symbolic link. I tried a few things to resolve this, including adding the following:

name: cache mountPath: /mnt/cache readOnly: true

but it had no effect.

Next, I plan to work on creating CWL steps that handle the download and upload processes directly.

For links to work in a predictable manner, they should link to files inside same PV.

Lets assume we have chosen to represent jobs with UUIDs and execution ids with serials:

jobId=266aa49b-458b-497c-ac11-91acd56e4d6e # a UUID
executionId=1 # a serial number (unique in the context of a job)

Then, we can follow a simple hierarchy like:

[PV-root]
  downloads
    1b8fd2b82dbaa7ffe0facb5d47afaa4465b2c436 # http://example.net/something
    054485a597ef0d05a28dcaf96c6c049bf1f922bc # http://example.net/foo
    ... # other downloads
  jobs
    266aa49b-458b-497c-ac11-91acd56e4d6e
      1
        inputs
           something -> ../../../../downloads/1b8fd2b82dbaa7ffe0facb5d47afaa4465b2c436 # relative (hard or symbolic) link
           ... # other inputs inside same execution
      ... # other executions
    ... # other jobs

In short:

jobs/${jobId}/${executionId}/inputs/${friendlyInputName} -> ../../../../downloads/${urlHash}

Implement File Upload and Download Handling for Streamflow

Designs

Child items ...

Activity

Updates:

WIP:

Current Assumptions:

Key Comments: