Skip to content

Latest commit

 

History

History
697 lines (530 loc) · 20 KB

ConfigurationGuide.md

File metadata and controls

697 lines (530 loc) · 20 KB

REDCap-ETL Configuration

This guide discusses the details of REDCap-ETL configuration properties.

Data Source

The data source is the REDCap project that has the data to be extracted. It is specified as the REDCap API token for the project and the REDCap API URL for the REDCap instance that has the project, for example:

redcap_api_url = http://redcap.someplace.edu/api/
data_source_api_token = 1A2FC4E892CC712031BA94D1198127C1

Transformation Rules

The transformation rules specify how the records in REDCap should be transformed into records in your database.

You need to set the transform_rules_source property to one of the numbers (1, 2, or 3), which are described below:

  1. text - rules are entered directly in the configuration file. This option is only available for JSON (.json) configuration files, and not .ini configuration files. JSON configuration files are
  2. file - the rules are stored in a file:
    • on the REDCap-ETL server, for file-based configuration
    • uploaded to the REDCap configuration project, for project-based configuration
  3. auto-generated - default rules (one table per form) are automatically generated by REDCap-ETL. This option is mainly intended for testing that an installation or project has been set up correctly.

For option 2 (the file option), you need to specify the file containing your transformation rules using the transform_rules_file property. So, for example, you might have the following in your configuration file:

transform_rules_source = 2
transform_rules_file = basic-demography-rules.txt

Using option 3 (auto-generated) above will not allow you to modify the auto-generated rules. To generate rules so that they can be modified, use the following script to generate a file of rules, and then edit those rules as appropriate:

bin/transformation_rules_generator.php

The script uses your REDCap-ETL configuration file to generate rules. For example, from the top-level directory of the installation, you could use the following command to generate rules for a config/test.ini configuration file:

php bin/transformation_rules_generator.php config/test.ini > config/test-rules.txt

The above script supports several options:

  1. -c, --complete-fields - includes form complete fields in the auto-generated transformation rules.
  2. -d, --dag-fields - includes DAG (Data Access Group) fields in the auto-generated transformation rules.
  3. -f, --file-fields - includes file fields in the auto-generated transformation rules. Note that if a file field is included in the rules, only a flag indicating that a document has been uploaded for this field will be exported from REDCap. The file contents will NOT be exported.
  4. -s, --survey-fields - includes survey fields (survey identifier and timestamp) for forms that are enabled as surveys.
  5. -n, --notes-fields - remove notes fields.
  6. -i, --identifier-fields - remove identifier fields.
  7. -t, --table-nonrepeating <table-name> - combine non-repeating fields into table "table-name" (for non-longitudinal projects).

Examples:

php bin/transformation_rules_generator.php -c -d -f config/test.ini > config/test-rules.txt
php bin/transformation_rules_generator.php -cdf config/test.ini > config/test-rules.txt
php bin/transformation_rules_generator.php --dag-fields config/test.ini > config/test-rules.txt
php bin/transformation_rules_generator.php --dag-fields --file-fields config/test.ini > config/test-rules.txt

After creating the rules file with the script, you will need to modify your configuration file to use the rules file, for example:

transform_rules_source = 2
transform_rules_file = test-rules.txt

Details about the transformation rules can be found here: Transformation Rules Guide

Database

In your configuration file, you need to set the property db_connection to a string that provides connection information for your database, for example:

db_connection = MySQL:127.10.10.1:etl_user:etl_password:etl_test_db

The database where the data is loaded is specified as a database connection string. This string has the following format:

    <database-connection-type> : <database-connection-values>

Currently, the supported database connection types are

  • MySQL
  • CSV (comma-separated values)
  • PostgreSQL
  • SQLite
  • SQL Server

MySQL

For MySQL, the format of the database connection string is:

    MySQL:<host>:<username>:<password>:<database>[:<port>]

Example MySQL database connection strings:

    MySQL:localhost:etl_user:etl_password:etl_test_db

    MySQL:someplace.edu:admin:admin_password_123:etl_prod_db:3306

Note: Since the ':' character is used as a separator for the database connection string, if any of the values in your database connection contain a ':', it needs to be escaped with a backslash. For example, if your password is "my:password", then it would need to be specified as "my\:password".

CSV

For CSV, the database connection string format is:

    CSV:<output-directory>

For example:

    CSV:/home/redcap-etl/csv/project1

PostgreSQL

For PostgreSQL, the database connection string format is:

PostgreSQL:<host>:<username>:<password>:<database>[:<schema>[:<port>]]

SQLite

For SQLite, the database connection string format is:

    SQLite:<database-file-path>

For example:

    SQLite:/home/redcap-etl/db/etl-data.db

SQL Server

For SQL Server, the format of the database connection string is:

    SQLServer:<host>:<username>:<password>:<database>[:<port>]

Example SQL server database connection strings:

    SQLServer:localhost:some_user:amazingPW987:etl_test_db

    SQLServer:someplace.edu:admin:adminPassword123:etl_prod_db:1433

Note: Since the ':' character is used as a separator for the database connection string, if any of the values in your database connection contain a ':', it needs to be escaped with a backslash. For example, if your password is "my:password", then it would need to be specified as "my\:password".


REDCap-ETL Configuration Properties

This section provides a detailed list of the configuration properties.

REDCap Connection Properties

Property Description
redcap_api_url The URL for your REDCap API. Typically this will be your REDCap URL with "/api/" appended to the end. Not ending the URL with a slash (/) may cause an error.
ssl_verify Indicates if SSL verification is used for the connection to REDCap. This defaults to true. Setting it to false is insecure.
ca_cert_file Certificate authority certificate file. This can be used to support SSL verification of the connection to REDCap if your system does not provide support for it by default

REDCap Project Properties

Property Description
data_source_api_token The API token for the REDCap project from which the data is being extracted from REDCap.
extract_filter_logic REDCap filter logic that is used to restrict the records extracted from the project. For example,
extract_filter_logic='[record_id] < 1020'
would cause only records with a record ID less than 1020 to be extracted from REDCap. Note that REDCap-ETL just passes the extract filter logic value to REDCap, and REDCap does minimal error checking. In many cases when you have an error in your filter, REDCap will not generate an error and no records will be returned. You can test out your filter logic in the API Playground within REDCap to make sure that it returns the records expected. Also note that some characters typically used in for the filter logic can cause an error when REDCap-ETL processes the configuration file. To prevent this from happening, it is best to put the filter logic value in quotes as shown in the example above.

Database Properties

Properties for the database where the extracted data is loaded.

Property Description
db_connection The database connection string for the database where the data is loaded.
db_ssl Flag that indicates if SSL should be used for database connection (true by default). This property is not applicable to CSV files or the SQLite database. Note: on Linux systems when using MySQL, having this set to true may cause the database connection to fail when the database host is specified as "localhost". To fix this problem, set db_ssl to 'false' (or 0), or specify "127.0.0.1" for the database host instead of "localhost".
db_ssl_verify Flag that indicates if the SSL certificate of the database server should be verified. For this to work, a valid ca_cert_file (certificate authority certificate file) needs to be specified. This property is not applicable to CSV files or SQLite databases.
db_primary_keys Flag that indicates if primary key constraints should be created for database tables (true by default). This functionality is not available for CSV files or SQLite databases.
db_foreign_keys Flag that indicates if foreign key constraints should be created for database tables that are specified as having parent tables in the transformation rules (true by default). This functionality is not available for CSV files or SQLite databases.

Database Logging Properties

REDCap-ETL logs to the load database by default. It creates and logs to 2 tables:

  1. etl_log - contains one row for each ETL process run, where each row contains the start time, table prefix used (if any), batch size, and REDCap-ETL version number.
  2. etl_event_log - contains the individual event messages for each ETL process run.

The 2 tables can be joined on their log_id attributes.

Unlike the other tables REDCap-ETL generates, the database logging tables are not deleted between runs, so these tables accumulate the results of all ETL runs.

Property Description
db_logging A true/false property indicating if REDCap-ETL should log to the database. The default value for this property is true. Database logging is not supported for CSV (comma-separated value) file output.
db_log_table The name of the main database logging table. This name defaults to **etl_log**.
db_event_log_table The name of the database logging event table. This name defaults to **etl_event_log**.

E-mail Properties

Properties for e-mail error notifications and processing summaries that can be sent by REDCap-ETL.

Property Description
email_errors True/false value that indicates if an e-mail should be sent when errors occur to the "email_to_list". The default value is true.
email_summary True/false value that indicates if an e-mail summary of the log messages should be sent to the "email_to_list". The default value is false, and no summary will be sent if the ETL process encounters an error.
email_from_address The from address for e-mail notifications sent by REDCap-ETL
email_subject The subject for e-mail notifications sent by REDCap-ETL
email_to_list The to address list for e-mail notifications sent by REDCap-ETL

Transformation Rules Auto-Generation Properties

For configurations where the transform_rules_source property is set to 3 (auto-generation), the following properties can be used to control the auto-generation of transformation rules:

Property Description
autogen_include_complete_fields Set to 1 or 'true' to include form complete fields in the auto-generated transformation rules. By default, they will not be included.
autogen_include_dag_fields Set to 1 or 'true' to include DAG (Data Access Group) fields in the auto-generated transformation rules. By default, they will not be included.
autogen_include_file_fields Set to 1 or 'true' to include file fields in the auto-generated transformation rules. By default, they will not be included.
autogen_include_survey_fields Set to 1 or 'true' to include survey fields in the auto-generated transformation rules. By default, they will not be included.
autogen_remove_notes_fields Set to 1 or 'true' to remove REDCap notes fields in the auto-generated transformation rules. By default, they will be included.
autogen_remove_identifier_fields Set to 1 or 'true' to remove REDCap fields that are marked as identifiers in the auto-generated transformation rules. By default, they will be included.
autogen_combine_non_repeating_fields Set to 1 or 'true' to have all fields in non-repeating forms combined into a single table for non-longitudinal projects. By default, each form will be in a separate table.
autogen_non_repeating_fields_table If the autogen_combine_non_repeating_fields property is set, this property needs to be set as the table name to use for the table that combines the non-repeating fields.

SQL Processing Properties

REDCap-ETL supports properties that can be used to specify custom SQL processing. You can specify SQL to be executed before and/or after the ETL process runs.

Note:

  • If a table prefix is specified for the ETL process, that prefix will NOT be automatically added to table names in the post-processing SQL file, so you will need to add it manually, and change it if the table prefix changes.
  • This feature is intended for running commands that update the database, and select commands will generate no output.
  • Custom SQL processing is not supported for CSV files.
Property Description
pre_processing_sql_file File with SQL statements to execute on the database before the ETL process runs. This can be useful for dropping indexes and views created using post-processing SQL.
post_processing_sql_file File with SQL statements to execute on the database after the ETL process has finished. This can be used, for example, to create indexes or custom views on tables generated by the ETL process.

Multiple Choice Label Properties

In REDCap, multiple choice questions have values and corresponding labels, e.g., [0 => "no", 1 => "yes"]. Originally, REDCap-ETL only displayed the values in the tables generated in the target database, and then created views that contained only the labels. Now REDCap-ETL generates both the values and labels in the tables. The current plan is to remove the label views in a future version of REDCap-ETL.

Property Description
label_view_suffix *(Deprecated)* A view is created that contains the labels for multiple choice questions (and not the codes/values)
label_field_suffix Suffix that will be added to multiple-choice label fields. The default value is "_label" To turn off these fields from being generated, set this to blank.

Generated Fields Type Properties

REDCap-ETL generates several fields automatically. The generated field type properties can be used to modify the types of these fields.

Property Description
generated_label_type The type used for the label fields generated by REDCap-ETL for checkbox, radio and dropdown fields. If varchar or char types are specified, the size for the labels will be calculated to be the minimum needed for the largest value for the specific field for the label.
generated_name_type The type used for name fields for events and instruments generated by REDCap-ETL (the redcap_event_name and redcap_repeat_instrument fields)
generated_record_id_type The type used for the REDCap record ID fields generated for each table by REDCap-ETL
generated_suffix_type The type used for the suffix fields generated by REDCap-ETL when a suffixes rows type is specified

Lookup Table Properties

REDCap-ETL uses a "Lookup" table internally to map REDCap's multiple choice values to their corresponding labels. This internal table can be saved to your database.

Property Description
create_lookup_table A true/false property that indicates whether or not a lookup table should be created in the database. The property defaults to "false" if not specified.
lookup_table_name The name to use for the lookup table created in the database. The default table name is "etl_lookup".

Other Properties

Property Description
calc_field_ignore_pattern A pattern represented as a PHP regular expression that will be used in determining what calculation fields values are considered when evaluating REDCap data rows for inclusion in a table. By default, if a calculation field is included in a table, then any REDCap data records which have a non-blank value for that calculation field will be included in the table. If, however, you specified the pattern '/^0$/' for this property, then REDCap-ETL would ignore calculation fields that are blank or zero when determining what REDCap data records should be included in the table.
ignore_empty_incomplete_forms In certain situations, REDCap will set the form complete field for a form to the "incomplete" value for a record, even though the form has never been edited or saved for that record. When this happens, by default, REDCap-ETL will store a row for that form for the record in the database. The row will have no values except for the form complete field, which will be set to zero. However, if this property is set to 1 or "true", then REDCap-ETL will ignore records for such forms.
extracted_record_count_check A true/false property that indicates whether or not a check should be done to make sure that the number of records extracted from REDCap is the number expected. This is done to catch cases where REDCap's API does not export all the records that it should due to an internal error, but does not notify REDCap-ETL of the error. By default this check is turned on, but you might need to turn it off if records could be deleted or added when the ETL process runs, which could also cause the counts not to match.
log_file File to use for logging
time_limit Maximum number of seconds the ETL process is allowed to run. A value of zero (the default value) means there is no limit.
time_zone Timezone to use; see the URL below for valid values: https://secure.php.net/manual/en/timezones.php By default, the default PHP timezone on the system is used.