It seems that the upload of input files and the download of output are not performed by Streamflow itself. So, we agreed with Mike that those steps should be implemented by us.
The idea is the following:
Each user will have a dedicated PVC.
Within each PVC, sub-directories will be structured as follows:
job_id/execution_id/input
job_id/execution_id/output
Output files should be available in S3 Buckets to facilitate downloading
Working on the implementation of the input file upload. I've managed to programmatically list the clusters, namespaces, PVCs, jobs and access the PVCs using temporary pods. The uploading process is the next step. For now, I’m working with the existing PVC job-data for the implementation.
The upload file process is completed. It currently uses static values to upload a file into the existing job-data PVC, placing it under the directory structure user_id/job_id/execution_id/input.
The implementation of downloading a file from the data proxy directly to the appropriate folder in the PVC has been completed. It is significantly faster than uploading it from a local PC. For the same file of size 150 MB, the direct download took 7 seconds, while the upload took 130 seconds.
I have completed the uploading of the output file to Data-Proxy. However, this functionality cannot yet be tested due to authentication issues with the Data-Proxy.
I have completed the capability to download the output file locally, which works as expected.
I have completed the dynamic generation of the kustomization.yml and env files needed for the execution.
I have ran the PSD end-to-end with the assumptions mentioned below.
WIP:
Developing the logic for the dynamic creation of the streamflow.yml file, which is the actual executable definition and must be generated based on the structure of the CWL files involved.
Current Assumptions:
Input and output files must exist directly under the root/job_id/ directory.
If input files are placed under job_id/execution_id/, the CWL workflow fails to locate them. This will be investigated in a later stage.
Key Comments:
The input.yml file must contain specific variables required for generating the kustomization.yml and env files manually. Clear instructions will need to be provided to users once this setup is finalized.
The main CWL file must be named work.cwl to be correctly recognized and executed.
@alexakis have you had the chance to check the way to run the worklfow via streamflow with assumptions provided by @vgeo ? We need to further sync about next steps, revisit points about architecture (https://gitlab.ebrains.eu/ri/tech-hub/devops/base/-/issues/266) and the provision of API for Dashboard needs.
Will you have the time to check these by next week (roughly 10,11 of July) and then we arrange a sync meeting for further actions with more team members ?
I pushed the latest version of the code to the main branch. Compared to the previous commit, I have added the functionality for uploading workflow output files to the data proxy, cleaned up the code, and updated the README file to make it fully informative. The use of symbolic links is still pending, as it hasn't worked yet, but I pushed the code since this is not blocking.
I have implemented the symbolic link mechanism to avoid re-downloading input files.
The process works as follows:
It computes the hash of the URL, downloads the file to the path mnt/cache/hash_, and creates a symbolic link at mnt/job_id/input/.
I've pushed this implementation to a separate branch with the function in comment because although the mechanism works, Streamflow is currently unable to read the file through the symbolic link. I tried a few things to resolve this, including adding the following:
name: cache
mountPath: /mnt/cache
readOnly: true
but it had no effect.
Next, I plan to work on creating CWL steps that handle the download and upload processes directly.
For links to work in a predictable manner, they should link to files inside same PV.
Lets assume we have chosen to represent jobs with UUIDs and execution ids with serials:
jobId=266aa49b-458b-497c-ac11-91acd56e4d6e # a UUIDexecutionId=1 # a serial number (unique in the context of a job)
Then, we can follow a simple hierarchy like:
[PV-root] downloads 1b8fd2b82dbaa7ffe0facb5d47afaa4465b2c436 # http://example.net/something 054485a597ef0d05a28dcaf96c6c049bf1f922bc # http://example.net/foo ... # other downloads jobs 266aa49b-458b-497c-ac11-91acd56e4d6e 1 inputs something -> ../../../../downloads/1b8fd2b82dbaa7ffe0facb5d47afaa4465b2c436 # relative (hard or symbolic) link ... # other inputs inside same execution ... # other executions ... # other jobs