This is a description of the measured values intended for people analyzing the data. Medical knowledge is not assumed, though the level of detail of the medical part would be shallow.
Currently this document describes the fake dataset generated by fake_data_grein
which contain less markers and comorbidities than the actual data would. But the overall structure should stay the same. Also the document is based on consultations with clinicians, but was not checked by a clinician yet - all mistakes are my own.
The data include only hospitalized patients.
We gather both patient data which correspond to the state of the patient upon admission to hospital and some summary values and disease progression data that are measured repeatedly over the course of the hospitalization. The dataset will be represented by a list
that contains elements for each type of data.
The primary patient-centric measurement is the final outcome (discharged, deceased or continued hospitalization) and the breathing support the patient requires, this can be one of:
AA
(Ambient air, no support required)Oxygen
(supplemental oxygen by a nasal tube or a light mask)NIPPV
(Non-invasive positive pressure ventilation)MV
(Mechanical, invasive ventilation)ECMO
(Extra corporeal membrane oxygenation - the patient’s blood is oxygenated outside of their body).
Those are strictly ordered by severity.
Patient data is stored in the patient_data
list element.
patient_id
a unique ID of the patient (unique across all study sites)hospital_id
a unique string identifying a study site. Study sites will be pseudonymized to increase patient anonymity, so the string will not be interpretable, but the IDs will be the same once more data arrives.age
age in yearsage_norm
normalized agesex
sex,M
orF
outcome
final outcome of the patient, one ofDischarged
,Hospitalized
(still hospitalized at the date of data collection),Transferred
(when transferred to a different hospital) andDeath
. For most purposesTransferred
can be considered as the same asHospitalized
last_record
day for which we have the last record for the patient, i.e. the date to which theoutcome
column refers (0 is the day of hospital admission)days_from_symptom_onset
number of days between symptom onset and hospitalization, "symptom onset" is defined as the day the patient or their carer subjectively first noticed any symptoms associated with Covid-19. It might not be available and is not a very reliable marker, but is relevant to determine if the patient was treated very early in their disease or not.admitted_for_covid
whether the patient was originally admitted in relation to Covid diagnosis (for some patients the Covid diagnosis was discovered while treating something else)best_supportive_care_from
day when "best supportive care" was started. If the patient was determined to be too frail for some treatments (e.g. mechanical ventilation), this indicates the first day when treatment that would otherwise be chosen was avoided and best supportive care was initiated (0 is the day of hospital admission)discontinued_medication
discontinued any of the Covid medications due to adverse evetns? Boolean.
BMI
the body mass index at hospital admissionischemic_heart_disease
n_hypertension_drugs
- the number of different anti-hypertensive drugs the patient uses regularly as a rough measure of the severity of the hypertension condition. Integer 0 means either not diagnost or not treated for hypertension.has_hypertension_drugs
boolean equivalent ton_hypertension_drugs > 0
heart_failure
booleanCOPD
boolean, Chronic obstructive pulmonary diseaseasthma
booleandiabetes
booleanrenal_disease
booleanliver_disease
booleanNYHA
New York Heart Association score for heart failure, if available or deducible from documentation (“NA” otherwise). The score has 4 levels:- 1: No limitation of physical activity. Ordinary physical activity does not cause undue fatigue, palpitation, dyspnea
- 2: Slight limitation of physical activity. Comfortable at rest. Ordinary physical activity results in fatigue, palpitation, dyspnea.
- 3: Marked limitation of physical activity. Comfortable at rest. Less than ordinary activity causes fatigue, palpitation, or dyspnea.
- 4: Unable to carry on any physical activity without discomfort. Symptoms of heart failure at rest. If any physical activity is undertaken, discomfort increases.
creatinin
Concentration of creatinine in serum (μmol/L),pt_inr
Prothrombin time (Quick test) as International Normalized Ratio,albumin
Concentration of albumin in serum/plasma (g/l)smoking
- does the patient smoke? Boolean.
Those are quantities derived from disease progression data that might be useful in analysis:
high_creatinin
creatinin above 115 for males or above 97 for femaleshigh_pt_inr
PT INR above 1.2low_albumin
albumin below 36heart_problems
NYHA > 1,obesity
BMI > 30,worst_breathing
the worst breathing level recorded orDeath
for deceased patientsfirst_day_invasive
,last_day_invasive
the first and last days the patient was recorded as having invasive breathing support (MV
orECMO
) - note that if the patient is removed from invasive ventilation and then deteriorates once more, this range will included some days without invasive ventilation. NA if never was invasive.took_hcq/az/convalescent_plasma/antibiotics
Took the given treatment at least once? Booleanany_IL_6/d_dimer
was IL-6/D-dimer ever measured? Booleancomorbidities_sum
number of all known comorbidities (NAs treated as not present)comorbidities_sum_na
number of all known comorbidities (NAs treated as half)
The most important part of the disease progression data is the breathing data which contains the breathing support used for each day. Those data should not have any gaps and cover the whole hospitalization period.
Breathing data is stored in the breathing_data
list element. The columns are:
patient
ID of the patient, matchingpatient_data
day
day of hospitalization (starting with 0 - first day at hospital)breathing
an ordered factor representing the breathing level as described above, includingDeath
andDischarged
as levels.
Note that day
can in some cases be negative when some data is availabe before hospitalization (this would almost certainly be only PCR test results).
Finally we collect a bunch of clinical markers of which most important are the drugs the patient used. Those are available in both long and wide formats (as marker_data
and marker_data_wide
). Markers are not measured every day and can be systematically missing for a whole site. The frequency of measurement of different markers can differ.
In the long format, the columns are:
patient
ID of the patientday
day of hospitalization (starting with 0)marker
the name of the marker and/or drug takenvalue
double value of the markercensored
a character string indicating whether the marker observation was censored. One ofleft
,right
andnone
In the wide format there is a column for each marker and for those, that can be censored an addtional xx_censored
column.
The markers are:
pcr_positive
whether the patient had a PCR test for virus presence positivepcr_value
if available, the Ct number of the PCR test, this is a rough indication of the viral load present. The higher Ct, the less virus was found. Ct number >35 is (mostly) considered a negative test. The concentration of viral RNA in the sample needs to increase roughly two-fold to make the Ct number drop by 1.oxygen_flow
when the patient is receiving supplemental oxygen (theOxygen
breathing support), this records how much oxygen they receive in liters/minute. Unfortunately, this can't be interpreted too strongly as a measure of severity, as the level of blood oxygenation achieved with the given flow must be considered (which is not yet included in the simulated dataset, will be added).crp
the C-reactive protein concentraion in blood in ng/ml. This is a non-specific marker of inflammation and lags the actual inflammation by 1-2 days. <3 is usually considered normal, > 30 indicates noticeable inflammation, viral pneumonias are associated with CRP around 50-100, bacterial pneumonia associated with CRP roughly 100-200 (bacterial superinfection is possible in Covid patients), CRP > 200 is associated with sepsis. The main advantage of CRP is that it changes by several order of magnitudes, much less than the measurement noise. Low levels can be censored but that's unlikely a major issue.d_dimer
the concentration of the D-dimer in blood which indicates the amount of blood coagulation happening, which can be a mark of complications (thrombosis, inflammation). D-dimer levels react with quite quickly to changes in patient's state. The normal level changes with age from roughly 0.5 for young healthy persons to around 0.8 for older patients. Values > 1 are generally considered pathological.ferritin
TODO, healthy patients have around 450. High levels can be censored.
For markers, missing values indicate the marker was not measured for the day.
The drugs are:
- Compounds with suspected activity against the virus itself:
hcq
Hydroxychloroquineaz
Azithromycin - usually administered in combination with HCQkaletra
Kaletra (Lopinavir/Ritonavir)
tocilizumab
which is suspected to alleviate the immune reaction to the virus and shorten the severe phase of the disease
For drugs, the values indicate the dose. Missing values indicate the patient didn't take the drug the given day.
It probably doesn't make a lot of sense to distinguish different dosing regimes of the drugs (there won't be enough data). Also, the effect of the drugs should be longer than the days they were taken - this is especially true for HCQ which is only very slowly removed from the body and can stay quite long at therapeutic concentrations even after the patient stopped taking it. For this reason it probably makes sense to analyse only "days before taking the drug" and "days after taking the drug for the first time".