NAACCR Tumor Registry v18 Data in i2b2 etc.: Beta Testing

While various issues remain, KUMC put a version of this code in production in the Sep 2019 HERON Great Salt Lake release.

NAACCR File ETL: Getting Started

We suppose you have access to your NAACCR v18 file; for testing, you can use naaccr-xml-sample-v180-incidence-100.txt.

For integration testing, we use the H2 Database Engine, but when you are ready, you should configure access to Postgres, SQL Server, or Oracle as in section 3.4.2 Set Database Properties of the i2b2 installation guide.

db.username=SA
db.password=
db.driver=org.h2.Driver
db.url=jdbc:h2:file:/tmp/DB1;create=true

naaccr.flat-file=naaccr_xml_samples/naaccr-xml-sample-v180-incidence-100.txt
naaccr.tumor-table: TUMOR

i2b2.patient-mapping-query: select distinct MRN, patient_num from ...
# i2b2.schema: i2b2demodata
# i2b2.template-fact-table: i2b2demodata.observation_fact

Then, to create the TUMOR table (less PATID column) following draft PCORnet specifications:

$ java -jar build/libs/naaccr-tumor-data.jar tumor-table
INFO gpc.DBConfig - getting config from db.properties
...
INFO all_naaccr.DAT: layout NAACCR 18 Incidence
...
INFO inserted 100 records into TUMOR

Patient Mapping using custom SQL

The i2b2 patient_mapping table typically includes a crosswalk from MRN to patient_num, but the details seem to be somewhat idiosyncratic. For example, to use NAACCR item 2300 (medical record number) from v15, which is 11 characters starting at column 3606:

update naacr.tumor tr
set tr.patient_num = (
    select pm.patient_num
    from i2b2data.patient_mapping pm
    where pm.patient_ide_source = 'MRN'
    and pm.patient_ide = trim(leading ' ' from substr(observation_blob, 3606, 11))
    )
;
commit;

Given a script such as the above, use java -jar build/libs/naaccr-tumor-data.jar run patient_mapping.sql to update the patient_num column of the tumor table.

TODO: the tumor table should also have a PATID VARCHAR column (issue 48).

When building i2b2 facts, set --mrn-item=medicalRecordNumber correspondingly (using NAACCR XML ids) and specify in db.properties:

i2b2.patient-mapping-query=select distinct tr.MRN, pm.patient_num
  from i2b2data.patient_mapping pm
  join (select trim(leading ' ' from substr(observation_blob, 3606, 11)) as MRN from tumor) tr
  on tr.MRN = pm.patient_ide

naaccr-tumor-data.jar Usage Reference

The naaccr-tumor-data command is short for java -jar naaccr-tumor-data.jar.

Usage:
  naaccr-tumor-data tumor-table [--db=PF] [--task-id=ID]
  naaccr-tumor-data tumor-files [--db=PF] NAACCR_FILE...
  naaccr-tumor-data load-layouts [--db=PF] [--layout-table=T]
  naaccr-tumor-data facts  [--db=F] --upload-id=NNN [--obs-src=S] [--mrn-item=N] [--encounter-start=N]
  naaccr-tumor-data summary  [--db=F] [--task-id=ID]
  naaccr-tumor-data ontology [--db=F] [--table-name=N] [--version=V] [--task-hash=H] [--update-date=D] [--who-cache=D]
  naaccr-tumor-data import [--db=F] TABLE DATA META
  naaccr-tumor-data run SCRIPT [--db=F]
  naaccr-tumor-data query SQL [--db=F]

Options:
  tumor-table        load TUMOR table from flat file
  --task-id=ID       version / completion marker [default: task123]
  tumor-files        load NAACCR records into a (CLOB) column of a DB table
  load-layouts       load NAACCR layout data
  --layout-table=T   where to load layout data [default: LAYOUT]
  facts              build OBSERVATION_FACT_NNN table
  --upload-id=NNN    to fill in observation_fact.upload_id [default: 1]
  --obs-src=S        sourcesystem_cd to give to facts [default: tumor_registry@kumed.com]
  --mrn-item=N       NAACCR item to use for patient mapping [default: patientIdNumber]
  --encounter-start=N  encounter_num start [default: 2000000]
  summary            build NAACCR_EXTRACT_STATS table
  --db=PROPS         database properties file [default: db.properties]
  ontology           build NAACCR_ONTOLOGY table
  --table-name=T     ontology table name [default: NAACCR_ONTOLOGY]
  --version=NNN      ontology version [default: 180]
  --task-hash=H      ontology completion marker
  --update-date=D    ontology update_date in YYYY-MM-DD format
  --who-cache=DIR    where to find WHO oncology metadata
  import             import CSV
  TABLE              target table name
  DATA               CSV file
  META               W3C tabular data metadata (JSON)
  run                run SQL script
  query              run SQL query and write results to stdout in JSON

NAACCR Record Layout Version 18

The NAACCR_ETL process used at KUMC and other GPC sites to load tumor registry data into i2b2 is outdated by version 18 of the NAACCR standard.

ref:

Thornton ML, (ed). Standards for Cancer Registries Volume II: Data Standards and Data Dictionary, Record Layout Version 18, 21st ed. Springfield, Ill.: North American Association of Central Cancer Registries, February 2018, (Revised Mar. 2018, Apr. 2018, May 2018, Jun. 2018, Aug. 2018, Sept. 2018, Oct. 2018).

Previous work: KUMC HERON i2b2 NAACCR ETL

2011: HERON i2b2 clinical data warehouse helps KUMC win CTSA award
2011: HERON TumorRegistry integration helps KU Med Center win NCI designation
2017: Using the NAACCR Cancer Registry in i2b2 with HERON ETL presented by Dan Connolly at i2b2 tranSMART Foundation User Group Meeting

please cite:

Waitman LR, Warren JJ, Manos EL, Connolly DW. Expressing Observations from Electronic Medical Record Flowsheets in an i2b2 based Clinical Data Repository to Support Research and Quality Improvement. AMIA Annu Symp Proc. 2011;2011:1454-63. Epub 2011 Oct 22.
Rogers AR, Lai S, Keighley J, Jungk J. The Incidence of Breast Cancer among Disabled Kansans with Medicare KJM 2015-08

Previous work: GPC Breast Cancer Survey

GPC Breast Cancer Data Quality Reporting - multi-site project with REDCap Data Dictionary
- "On 23 Dec 2014, GPC honest brokers were requested to run a breast cancer cohort query ..."
NAACCR_ETL - GPC NAACCR ETL wiki page

cite:

Chrischilles et. al. Upper extremity disability and quality of life after breast cancer treatment in the Greater Plains Collaborative clinical research network. Breast Cancer Res Treat. 2019 Jun;175(3):675-689. doi: 10.1007/s10549-019-05184-1. Epub 2019 Mar 9.
Waitman LR, Aaronson LS, Nadkarni PM, Connolly DW, Campbell JR. The greater plains collaborative: a PCORnet clinical research data network. J Am Med Inform Assoc. 2014;21:637–641. doi: 10.1136/amiajnl-2014-002756.

Building on NAACCR XML from SEER

naaccr-xml - NAACCR XML reader in Java by F. Depry of IMS for SEER
- first release: v0.5 (beta) Apr 20, 2015; v1.0 Feb 7, 2016
- frequent release:
  - v6.6 Feb 6, 2020
  - ...
  - v5.4 Jun 13, 2019
  - v5.3 May 21, 2019
XML replaces flat file in 2020 (IOU citation)
also supports flat files
imsweb/layout has sections, codes, etc.
NAACCR XML WG meets alternate Fridays 11amET (e.g. Aug 2)

Coded concepts

Metadata for coded values is also work in progress.

HERON ETL was based on a NAACCR v12 MS Access DB that no longer seems to be maintained / published.
currently using a mix of:
- LOINC answer lists (from v11 and v12)
- well curated code-labels: naaccr NAACCR reader in R by N. Werth of PA Dept. of Health
hope to incorporate codes from imsweb/layout

Primary sites, morphologies from WHO

Maintained by the World Health Organization (WHO)

primary sites: e.g. C50 for Breast
- i2b2 ontology support ported from HERON ETL
morphology: 9800/3 for Leukemia
- morphology i2b2 ontology support TODO

SEER Site Recode

combines primary site and histology
e.g. 20010 for Lip
i2b2 ontology support ported from HERON ETL

site-specific factors

Obsolete in 2018, but to capture data from older cases...

site-specific factors from cancerstaging.org
- added to HERON March 2016; see GPC ticket 150
- These are obsolete in cases abstracted per v18 but still used in older cases.
- WerthPADOH issue: Handle site-specific codes in fields #35 opened Apr 24 2019

Platform for v18: JVM, JDBC, H2 DB, tablesaw, groovy, (and luigi)

We are taking this opportunity to reconsider our platform and approach:

Portable to database engines other than Oracle
Explicit tracking of data flow to facilitate parallelism
Rich test data to facilitate development without access to private data
Separate repository from HERON EMR ETL to avoid information blocking friction

See CONTRIBUTING for details.

JDBC
- lets us leverage the working knowledge of SQL in our community
  - HERON ETL: 30KLOC of SQL
- portable: same JVM platform as i2b2
- JDBC connectivity to datamarts
- H2 for in-memory DB
groovy to fill in gaps where SQL is awkward, such as
- iterating over columns or tables
- tablesaw Dataframe library a la python pandas, Spark
- difference from Java worthwhile? see CONTRIBUTING
luigi (optional)
- luigi tasks preserve partial results

NAACCR Ontology for i2b2: Luigi Usage

The NAACCR_Ontology1 task creates a NAACCR_ONTOLOGY table:

$ luigi --module tumor_reg_tasks NAACCR_Ontology1
DEBUG: Checking if NAACCR_Ontology1(design_id=upper, naaccr_version=18, naaccr_ch10_bytes=3078052) is complete
15:48:09 INFO: ...status   PENDING
15:48:09 INFO: Running Worker with 1 processes
...
15:48:20 ===== Luigi Execution Summary =====
15:48:20
15:48:20 Scheduled 1 tasks of which:
15:48:20 * 1 ran successfully:
15:48:20     - 1 NAACCR_Ontology1(...)

If you're interested in luigi usage, see client.cfg for details. If not, see:

tumor_reg_ont.py
heron_load/tumor_item_value.csv, and
heron_load/naaccr_concepts_load.sql

Related work: OMOP / OHDSI Ongology Working Group

OMOP CDM WG - Oncology Subgroup
- call for use cases May 2018
  - revived after Jun 25 meeting
- using SEER API in vocabulary mapping work
- Standing weekly meetings: Tuesday, 11 am ET. (e.g. 7/9/2019)

Wish list: missing data sentinels

WerthPADOH includes sentinel codes for missing data
- e.g. Grade code 9 = "Grade/differentiation unknown, not stated, or not applicable"
"Absence of data is not represented within OMOP." -- OMOP Oncology WG NAACCR treatment ETL instructions

`NAACCR_PATIENTS`, `NAACCR_TUMORS`, and `NAACCR_OBSERVATIONS`: Usage

Using tumor_reg_data.py, the NAACCR_Load task turns a NAACCR v18 flat file into tables for patients, tumors, and observations:

$ luigi --module tumor_reg_tasks NAACCR_Load
...
15:02:04 5890 INFO: Informed scheduler that task   NAACCR_Load_2019_08_20_tumor_registry_k_1306872023_225d26f0cd   has status   DONE

The tables are:

naaccr_patients, naaccr_tumors, naaccr_observations
observation_fact_NNNN, observation_fact_deid_NNNN
- where NNNN is an upload_id

naaccr_patients and observation_fact_NNNN depend on an existing patient_mapping table. naaccr_tumors uses a reserved range of encounter_num.

TODO: publish generated notebook; smooth out the level of detail here vs. there.

Use case: GPC Breast Cancer survey

GENERATED_SQL for two i2b2 queries from GPC Breast cancer work

See test_data/bcNNN_generated.sql.

Toward Synthetic NAACCR Test Data

In test_data:

capture statistics of NAACCR data
synthesize test data with similar distributions

See test_data/data_char_sim.sql, test_data/tr_summary.py

Goal: data characterization, checks, charts

jupyter notebook of checks, charts

PCORNet CDM Emperical Data Characterization report
GPC Breast Cancer QA reports

Toward PCORNet CDM Integration

exploring FIELDS, VALUESETS a la PCORNET CDM for a TUMOR table
Another approach: i2b2 OBSERVATION_FACT -> PCORNet OBS_GEN.
- crosswalk to LOINC (as of NAACCR v12): loinc-naaccr/loinc_naaccr.csv
  - on loinc-csvdb branch

See pcornet_cdm/ directory.

Goal: "data lake" ETL, multiple i2b2 fact tables

HERON ETL 2011: copy NAACCR flat file into DB with Oracle sqlldr
Spark approach:
1. specify transformation from the NAACCR flat file to an i2b2 fact table
  - use PySpark to un-pivot / melt
2. facts.write.jdbc(...oracle db...)
  - Spark it schedules work of transforming the flat file.
aim to use multiple fact tables in i2b2 1.7.09.
- crc.properties set queryprocessor.multifacttable=true
- in ontology table, set c_facttablecolumn=NAACCR_FACT.concept_cd

Related work: FHIR / HL7

HL7 FHIR Implementation Guide: Breast Cancer Data, Release 1 - US Realm (Draft for Comment 2)
mCODE - Standard Health Record Collaborative HL7 FHIR Implementation Guide: minimal Common Oncology Data Elements (mCODE), v0.9.0, Version: 0.9.0 ; FHIR © Version: 1.0.2 ; Generated on Wed, Apr 17, 2019 12:01-0400.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

NAACCR Tumor Registry v18 Data in i2b2 etc.: Beta Testing

NAACCR File ETL: Getting Started

Patient Mapping using custom SQL

naaccr-tumor-data.jar Usage Reference

NAACCR Record Layout Version 18

Previous work: KUMC HERON i2b2 NAACCR ETL

Previous work: GPC Breast Cancer Survey

Building on NAACCR XML from SEER

Coded concepts

Primary sites, morphologies from WHO

SEER Site Recode

site-specific factors

Platform for v18: JVM, JDBC, H2 DB, tablesaw, groovy, (and luigi)

NAACCR Ontology for i2b2: Luigi Usage

Related work: OMOP / OHDSI Ongology Working Group

Wish list: missing data sentinels

`NAACCR_PATIENTS`, `NAACCR_TUMORS`, and `NAACCR_OBSERVATIONS`: Usage

Use case: GPC Breast Cancer survey

Toward Synthetic NAACCR Test Data

Goal: data characterization, checks, charts

Toward PCORNet CDM Integration

Goal: "data lake" ETL, multiple i2b2 fact tables

Related work: FHIR / HL7

Files

README.md

Latest commit

History

README.md

File metadata and controls

NAACCR Tumor Registry v18 Data in i2b2 etc.: Beta Testing

NAACCR File ETL: Getting Started

Patient Mapping using custom SQL

naaccr-tumor-data.jar Usage Reference

NAACCR Record Layout Version 18

Previous work: KUMC HERON i2b2 NAACCR ETL

Previous work: GPC Breast Cancer Survey

Building on NAACCR XML from SEER

Coded concepts

Primary sites, morphologies from WHO

SEER Site Recode

site-specific factors

Platform for v18: JVM, JDBC, H2 DB, tablesaw, groovy, (and luigi)

NAACCR Ontology for i2b2: Luigi Usage

Related work: OMOP / OHDSI Ongology Working Group

Wish list: missing data sentinels

NAACCR_PATIENTS, NAACCR_TUMORS, and NAACCR_OBSERVATIONS: Usage

Use case: GPC Breast Cancer survey

Toward Synthetic NAACCR Test Data

Goal: data characterization, checks, charts

Toward PCORNet CDM Integration

Goal: "data lake" ETL, multiple i2b2 fact tables

Related work: FHIR / HL7

`NAACCR_PATIENTS`, `NAACCR_TUMORS`, and `NAACCR_OBSERVATIONS`: Usage