While various issues remain, KUMC put a version of this code in production in the Sep 2019 HERON Great Salt Lake release.
We suppose you have access to your NAACCR v18 file; for testing, you can use naaccr-xml-sample-v180-incidence-100.txt.
For integration testing, we use the H2 Database Engine, but when you are ready, you should configure access to Postgres, SQL Server, or Oracle as in section 3.4.2 Set Database Properties of the i2b2 installation guide.
db.username=SA
db.password=
db.driver=org.h2.Driver
db.url=jdbc:h2:file:/tmp/DB1;create=true
naaccr.flat-file=naaccr_xml_samples/naaccr-xml-sample-v180-incidence-100.txt
naaccr.tumor-table: TUMOR
i2b2.patient-mapping-query: select distinct MRN, patient_num from ...
# i2b2.schema: i2b2demodata
# i2b2.template-fact-table: i2b2demodata.observation_fact
Then, to create the TUMOR
table
(less PATID column)
following draft PCORnet specifications:
$ java -jar build/libs/naaccr-tumor-data.jar tumor-table
INFO gpc.DBConfig - getting config from db.properties
...
INFO all_naaccr.DAT: layout NAACCR 18 Incidence
...
INFO inserted 100 records into TUMOR
The i2b2 patient_mapping
table typically includes a crosswalk
from MRN to patient_num
, but the details seem to be somewhat idiosyncratic.
For example, to use NAACCR item 2300 (medical record number) from v15,
which is 11 characters starting at column 3606:
update naacr.tumor tr
set tr.patient_num = (
select pm.patient_num
from i2b2data.patient_mapping pm
where pm.patient_ide_source = 'MRN'
and pm.patient_ide = trim(leading ' ' from substr(observation_blob, 3606, 11))
)
;
commit;
Given a script such as the above, use
java -jar build/libs/naaccr-tumor-data.jar run patient_mapping.sql
to update the patient_num
column of the tumor
table.
TODO: the tumor table should also have a PATID VARCHAR
column
(issue 48).
When building i2b2 facts, set --mrn-item=medicalRecordNumber
correspondingly
(using NAACCR XML ids) and specify in db.properties
:
i2b2.patient-mapping-query=select distinct tr.MRN, pm.patient_num
from i2b2data.patient_mapping pm
join (select trim(leading ' ' from substr(observation_blob, 3606, 11)) as MRN from tumor) tr
on tr.MRN = pm.patient_ide
The naaccr-tumor-data
command is short for java -jar naaccr-tumor-data.jar
.
Usage:
naaccr-tumor-data tumor-table [--db=PF] [--task-id=ID]
naaccr-tumor-data tumor-files [--db=PF] NAACCR_FILE...
naaccr-tumor-data load-layouts [--db=PF] [--layout-table=T]
naaccr-tumor-data facts [--db=F] --upload-id=NNN [--obs-src=S] [--mrn-item=N] [--encounter-start=N]
naaccr-tumor-data summary [--db=F] [--task-id=ID]
naaccr-tumor-data ontology [--db=F] [--table-name=N] [--version=V] [--task-hash=H] [--update-date=D] [--who-cache=D]
naaccr-tumor-data import [--db=F] TABLE DATA META
naaccr-tumor-data run SCRIPT [--db=F]
naaccr-tumor-data query SQL [--db=F]
Options:
tumor-table load TUMOR table from flat file
--task-id=ID version / completion marker [default: task123]
tumor-files load NAACCR records into a (CLOB) column of a DB table
load-layouts load NAACCR layout data
--layout-table=T where to load layout data [default: LAYOUT]
facts build OBSERVATION_FACT_NNN table
--upload-id=NNN to fill in observation_fact.upload_id [default: 1]
--obs-src=S sourcesystem_cd to give to facts [default: tumor_registry@kumed.com]
--mrn-item=N NAACCR item to use for patient mapping [default: patientIdNumber]
--encounter-start=N encounter_num start [default: 2000000]
summary build NAACCR_EXTRACT_STATS table
--db=PROPS database properties file [default: db.properties]
ontology build NAACCR_ONTOLOGY table
--table-name=T ontology table name [default: NAACCR_ONTOLOGY]
--version=NNN ontology version [default: 180]
--task-hash=H ontology completion marker
--update-date=D ontology update_date in YYYY-MM-DD format
--who-cache=DIR where to find WHO oncology metadata
import import CSV
TABLE target table name
DATA CSV file
META W3C tabular data metadata (JSON)
run run SQL script
query run SQL query and write results to stdout in JSON
The NAACCR_ETL process used at KUMC and other GPC sites to load tumor registry data into i2b2 is outdated by version 18 of the NAACCR standard.
ref:
- Thornton ML, (ed). Standards for Cancer Registries Volume II: Data Standards and Data Dictionary, Record Layout Version 18, 21st ed. Springfield, Ill.: North American Association of Central Cancer Registries, February 2018, (Revised Mar. 2018, Apr. 2018, May 2018, Jun. 2018, Aug. 2018, Sept. 2018, Oct. 2018).
- 2011: HERON i2b2 clinical data warehouse helps KUMC win CTSA award
- 2011: HERON TumorRegistry integration helps KU Med Center win NCI designation
- 2017: Using the NAACCR Cancer Registry in i2b2 with HERON ETL presented by Dan Connolly at i2b2 tranSMART Foundation User Group Meeting
please cite:
- Waitman LR, Warren JJ, Manos EL, Connolly DW. Expressing Observations from Electronic Medical Record Flowsheets in an i2b2 based Clinical Data Repository to Support Research and Quality Improvement. AMIA Annu Symp Proc. 2011;2011:1454-63. Epub 2011 Oct 22.
- Rogers AR, Lai S, Keighley J, Jungk J. The Incidence of Breast Cancer among Disabled Kansans with Medicare KJM 2015-08
- GPC Breast Cancer Data Quality Reporting - multi-site
project with REDCap Data Dictionary
- "On 23 Dec 2014, GPC honest brokers were requested to run a breast cancer cohort query ..."
- NAACCR_ETL - GPC NAACCR ETL wiki page
cite:
- Chrischilles et. al. Upper extremity disability and quality of life after breast cancer treatment in the Greater Plains Collaborative clinical research network. Breast Cancer Res Treat. 2019 Jun;175(3):675-689. doi: 10.1007/s10549-019-05184-1. Epub 2019 Mar 9.
- Waitman LR, Aaronson LS, Nadkarni PM, Connolly DW, Campbell JR. The greater plains collaborative: a PCORnet clinical research data network. J Am Med Inform Assoc. 2014;21:637–641. doi: 10.1136/amiajnl-2014-002756.
- naaccr-xml - NAACCR XML reader in Java by F. Depry of IMS for SEER
- first release: v0.5 (beta) Apr 20, 2015; v1.0 Feb 7, 2016
- frequent release:
- v6.6 Feb 6, 2020
- ...
- v5.4 Jun 13, 2019
- v5.3 May 21, 2019
- XML replaces flat file in 2020 (IOU citation)
- also supports flat files
- imsweb/layout has sections, codes, etc.
- NAACCR XML WG meets alternate Fridays 11amET (e.g. Aug 2)
Metadata for coded values is also work in progress.
- HERON ETL was based on a NAACCR v12 MS Access DB that no longer seems to be maintained / published.
- currently using a mix of:
- LOINC answer lists (from v11 and v12)
- well curated code-labels: naaccr NAACCR reader in R by N. Werth of PA Dept. of Health
- hope to incorporate codes from imsweb/layout
Maintained by the World Health Organization (WHO)
- primary sites: e.g.
C50
for Breast- i2b2 ontology support ported from HERON ETL
- morphology:
9800/3
for Leukemia- morphology i2b2 ontology support TODO
- combines primary site and histology
- e.g.
20010
for Lip - i2b2 ontology support ported from HERON ETL
Obsolete in 2018, but to capture data from older cases...
- site-specific factors from cancerstaging.org
- added to HERON March 2016; see GPC ticket 150
- These are obsolete in cases abstracted per v18 but still used in older cases.
- WerthPADOH issue: Handle site-specific codes in fields #35 opened Apr 24 2019
We are taking this opportunity to reconsider our platform and approach:
- Portable to database engines other than Oracle
- Explicit tracking of data flow to facilitate parallelism
- Rich test data to facilitate development without access to private data
- Separate repository from HERON EMR ETL to avoid information blocking friction
See CONTRIBUTING for details.
- JDBC
- lets us leverage the working knowledge of SQL in our community
- HERON ETL: 30KLOC of SQL
- portable: same JVM platform as i2b2
- JDBC connectivity to datamarts
- H2 for in-memory DB
- lets us leverage the working knowledge of SQL in our community
- groovy to fill in gaps where SQL is awkward, such as
- iterating over columns or tables
- tablesaw Dataframe library a la python pandas, Spark
- difference from Java worthwhile? see CONTRIBUTING
- luigi (optional)
- luigi tasks preserve partial results
The NAACCR_Ontology1
task creates a NAACCR_ONTOLOGY
table:
$ luigi --module tumor_reg_tasks NAACCR_Ontology1
DEBUG: Checking if NAACCR_Ontology1(design_id=upper, naaccr_version=18, naaccr_ch10_bytes=3078052) is complete
15:48:09 INFO: ...status PENDING
15:48:09 INFO: Running Worker with 1 processes
...
15:48:20 ===== Luigi Execution Summary =====
15:48:20
15:48:20 Scheduled 1 tasks of which:
15:48:20 * 1 ran successfully:
15:48:20 - 1 NAACCR_Ontology1(...)
If you're interested in luigi usage, see client.cfg
for details. If not, see:
tumor_reg_ont.py
heron_load/tumor_item_value.csv
, andheron_load/naaccr_concepts_load.sql
- OMOP CDM WG - Oncology Subgroup
- call for use cases May 2018
- revived after Jun 25 meeting
- using SEER API in vocabulary mapping work
- Standing weekly meetings: Tuesday, 11 am ET. (e.g. 7/9/2019)
- call for use cases May 2018
- WerthPADOH includes sentinel codes for missing data
- e.g. Grade code 9 = "Grade/differentiation unknown, not stated, or not applicable"
- "Absence of data is not represented within OMOP." -- OMOP Oncology WG NAACCR treatment ETL instructions
Using tumor_reg_data.py
, the NAACCR_Load
task turns a NAACCR v18
flat file into tables for patients, tumors, and observations:
$ luigi --module tumor_reg_tasks NAACCR_Load
...
15:02:04 5890 INFO: Informed scheduler that task NAACCR_Load_2019_08_20_tumor_registry_k_1306872023_225d26f0cd has status DONE
The tables are:
naaccr_patients
,naaccr_tumors
,naaccr_observations
observation_fact_NNNN
,observation_fact_deid_NNNN
- where NNNN is an upload_id
naaccr_patients
and observation_fact_NNNN
depend on an existing
patient_mapping
table. naaccr_tumors
uses a reserved range of
encounter_num
.
TODO: publish generated notebook; smooth out the level of detail here vs. there.
- GENERATED_SQL for two i2b2 queries from GPC Breast cancer work
See test_data/bcNNN_generated.sql
.
In test_data
:
- capture statistics of NAACCR data
- synthesize test data with similar distributions
See test_data/data_char_sim.sql
, test_data/tr_summary.py
jupyter notebook of checks, charts
- PCORNet CDM Emperical Data Characterization report
- GPC Breast Cancer QA reports
- exploring FIELDS, VALUESETS a la PCORNET CDM for a TUMOR table
- Another approach: i2b2
OBSERVATION_FACT
-> PCORNetOBS_GEN
.- crosswalk to LOINC (as of NAACCR v12):
loinc-naaccr/loinc_naaccr.csv
- on loinc-csvdb branch
- crosswalk to LOINC (as of NAACCR v12):
See pcornet_cdm/
directory.
- HERON ETL 2011: copy NAACCR flat file into DB with Oracle sqlldr
- Spark approach:
- specify transformation from the NAACCR flat file to an i2b2 fact table
- use PySpark to un-pivot / melt
facts.write.jdbc(...oracle db...)
- Spark it schedules work of transforming the flat file.
- specify transformation from the NAACCR flat file to an i2b2 fact table
- aim to use multiple fact tables
in i2b2 1.7.09.
crc.properties
setqueryprocessor.multifacttable=true
- in ontology table, set
c_facttablecolumn=NAACCR_FACT.concept_cd
- HL7 FHIR Implementation Guide: Breast Cancer Data, Release 1 - US Realm (Draft for Comment 2)
- mCODE - Standard Health Record Collaborative HL7 FHIR Implementation Guide: minimal Common Oncology Data Elements (mCODE), v0.9.0, Version: 0.9.0 ; FHIR © Version: 1.0.2 ; Generated on Wed, Apr 17, 2019 12:01-0400.