|
|
For profiling/validating a csv dataset:
|
|
# Profiling and Validating a CSV dataset:
|
|
|
|
|
|
|
|
``` shell
|
|
``` shell
|
|
|
Usage: qctool csv <options> <csv file> <schema json>
|
|
Usage: qctool csv <options> <csv file> <schema json>
|
| ... | @@ -27,11 +27,11 @@ Options: |
... | @@ -27,11 +27,11 @@ Options: |
|
|
|
|
|
|
|
The report file will be saved in the folder where the incoming dataset file is located.
|
|
The report file will be saved in the folder where the incoming dataset file is located.
|
|
|
|
|
|
|
|
#### Data Cleaning
|
|
## Data Cleaning
|
|
|
|
|
|
|
|
After reviewing the **Data Validation** report created in the previous step (Please refer to the **Data Validation Report** wiki section for further details), we can proceed with the data cleaning operation. The cleaned dataset file will be saved in same folder where the incoming dataset is located by using the original dataset name with the addition of the suffix '_corrected'.
|
|
After reviewing the **Data Validation** report created in the previous step (Please refer to the **Data Validation Report** wiki section for further details), we can proceed with the data cleaning operation. The cleaned dataset file will be saved in same folder where the incoming dataset is located by using the original dataset name with the addition of the suffix '_corrected'.
|
|
|
|
|
|
|
|
For infering a dataset's schema:
|
|
# Inference of a CSV dataset's schema
|
|
|
|
|
|
|
|
```shell
|
|
```shell
|
|
|
Usage: qctool infercsv <options> <csv file>
|
|
Usage: qctool infercsv <options> <csv file>
|
| ... | @@ -60,6 +60,8 @@ Options: |
... | @@ -60,6 +60,8 @@ Options: |
|
|
--help Show this message and exit.
|
|
--help Show this message and exit.
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
|
## Supported schema formats
|
|
|
|
|
|
|
The schema could be saved in two formats:
|
|
The schema could be saved in two formats:
|
|
|
|
|
|
|
|
1. Frictionless spec json
|
|
1. Frictionless spec json
|
| ... | @@ -69,7 +71,7 @@ In the infer option section, we give the number of rows that the tool will based |
... | @@ -69,7 +71,7 @@ In the infer option section, we give the number of rows that the tool will based |
|
|
|
|
|
|
|
If we choose the Data Catalogue's excel as an output, the tool offers the option of suggesting CDE variables for each column of the incoming dataset. This option is possible, only when a CDE dictionary is provided. This dictionary is an excel file that contains information for all the CDE variables that are included or will be included in the MIP (this dictionary will be available in the Data Catalogue in the near future). The tool calculates a similarity measure for each column based on the column name similarity (80%) and the value range similarity (20%). The similarity measure takes values between 0 and 1. With the option **similarity threshold** we can define the minimum similarity measure between an incoming column and a CDE variable that need to be met in order the tool to suggest that CDE variable as a possible correspondence. The tool stores those CDE suggestions in the excel file in the column named **CDE** and also stores the corresponding concept path under the column **conceptPath**.
|
|
If we choose the Data Catalogue's excel as an output, the tool offers the option of suggesting CDE variables for each column of the incoming dataset. This option is possible, only when a CDE dictionary is provided. This dictionary is an excel file that contains information for all the CDE variables that are included or will be included in the MIP (this dictionary will be available in the Data Catalogue in the near future). The tool calculates a similarity measure for each column based on the column name similarity (80%) and the value range similarity (20%). The similarity measure takes values between 0 and 1. With the option **similarity threshold** we can define the minimum similarity measure between an incoming column and a CDE variable that need to be met in order the tool to suggest that CDE variable as a possible correspondence. The tool stores those CDE suggestions in the excel file in the column named **CDE** and also stores the corresponding concept path under the column **conceptPath**.
|
|
|
|
|
|
|
|
For profiling a dicom dataset:
|
|
# DICOM MRI metadata validation
|
|
|
|
|
|
|
|
``` shell
|
|
``` shell
|
|
|
Usage: qctool dicom <options> <dicom folder> <report folder>
|
|
Usage: qctool dicom <options> <dicom folder> <report folder>
|
| ... | @@ -79,13 +81,15 @@ Usage: qctool dicom <options> <dicom folder> <report folder> |
... | @@ -79,13 +81,15 @@ Usage: qctool dicom <options> <dicom folder> <report folder> |
|
|
|
|
|
|
|
`<report folder>` is the folder where the report files will be placed. If the folder does not exist, the tool will create it.
|
|
`<report folder>` is the folder where the report files will be placed. If the folder does not exist, the tool will create it.
|
|
|
|
|
|
|
|
Options:
|
|
## Options
|
|
|
|
|
|
|
|
`--loris_folder` folder path where the dcm files are reorganized for LORIS pipeline
|
|
`--loris_folder` folder path where the dcm files are reorganized for LORIS pipeline
|
|
|
|
|
|
|
|
For the LORIS pipeline the dcm files are reorganized and stored in a folder structure `<loris_folder>/<patientid>/<patientid_visitcount>`.
|
|
For the LORIS pipeline the dcm files are reorganized and stored in a folder structure `<loris_folder>/<patientid>/<patientid_visitcount>`.
|
|
|
All the dcm sequence files that belong to the same scanning session (visit) are stored in the common folder `<patientid_visitcount>`.
|
|
All the dcm sequence files that belong to the same scanning session (visit) are stored in the common folder `<patientid_visitcount>`.
|
|
|
|
|
|
|
|
|
## Output files
|
|
|
|
|
|
|
The tool creates in the `<report folder>`, a pdf report file (`dicom_report.pdf`) and, depending of the results, also creates the following csv files :
|
|
The tool creates in the `<report folder>`, a pdf report file (`dicom_report.pdf`) and, depending of the results, also creates the following csv files :
|
|
|
|
|
|
|
|
- validsequences.csv
|
|
- validsequences.csv
|
| ... | @@ -96,6 +100,10 @@ The tool creates in the `<report folder>`, a pdf report file (`dicom_report.pdf` |
... | @@ -96,6 +100,10 @@ The tool creates in the `<report folder>`, a pdf report file (`dicom_report.pdf` |
|
|
|
|
|
|
|
The above files are created even if no valid/invalid sequences/dicoms files have been found. In such case, the files will be empty.
|
|
The above files are created even if no valid/invalid sequences/dicoms files have been found. In such case, the files will be empty.
|
|
|
|
|
|
|
|
|
### dicom_report.pdf
|
|
|
|
|
|
|
|
Please refer to the repo wiki section [Reports - Descriptions and Details](https://github.com/aueb-wim/DataQualityControlTool/wiki/Reports---Descriptions-and-Details) for a detailed explanation of the content of this report.
|
|
|
|
|
|
|
### validsequences.csv
|
|
### validsequences.csv
|
|
|
|
|
|
|
|
If there are valid sequences then the tool will create this csv file. A sequence is 'valid' if it meets the minimum requirements found [here](https://hbpmedical.github.io/deployment/data/). This file contains all the valid MRI sequences that found in given DICOM folder with the following headers discribing each sequence:
|
|
If there are valid sequences then the tool will create this csv file. A sequence is 'valid' if it meets the minimum requirements found [here](https://hbpmedical.github.io/deployment/data/). This file contains all the valid MRI sequences that found in given DICOM folder with the following headers discribing each sequence:
|
| ... | | ... | |