Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean Data Export #36

Open
lukeloken opened this issue Oct 22, 2015 · 43 comments
Open

Clean Data Export #36

lukeloken opened this issue Oct 22, 2015 · 43 comments
Assignees

Comments

@lukeloken
Copy link

Is it possible to create a 'clean' function that removes the flagged observations.

Suggestions:
Add an argument "which" that can either be "ALL" or a list of rules with the default being ALL.
Add an argument that simply deletes the whole row.

clean(flaggedobject, which=ALL, replace=NA)
clean(flaggedobject, which=c(1,2,4), replace=99999)
clean(flaggedobject, which=ALL, deleterow=TRUE)

@jordansread
Copy link
Member

Cool, thanks @lukeloken. Great ideas.

@jordansread
Copy link
Member

the one change I might suggest is:

clean(flaggedobject, which='ALL', replace=NA) # to replace w/ NA
clean(flaggedobject, which='ALL', replace=NULL) # to delete row

and the ability to do this:

clean(object, which, ..., replace) #as the function arguments, which would support this:
clean(object, 'x > 9999', 'persist(x) > 10', 'MAD(x) > 3', replace=NA)

Thoughts @lukeloken and @aappling-usgs ?

@lukeloken
Copy link
Author

I like the replace=Null option.

Do you want the ability to 'clean' an unflagged object?

clean(object, 'x > 9999', 'persist(x) > 10', replace=NA)

Or is it more straightforward to run two commands?

flagged <- flag(object, 'x > 9999', 'persist(x) > 10')
cleaned <- clean(flagged, which=ALL, replace=NA) 

@jordansread
Copy link
Member

I like that option personally, because then I can use sensorQC to really rapidly clean a vector or data.frame with one line. The key is having it be a good thing for powerusers but not get in the way of a more normal workflow.

@lukeloken
Copy link
Author

I like having both options, but I don't know how difficult this is to code.

@jordansread
Copy link
Member

oh yeah, for sure, I was thinking both.

@lukeloken
Copy link
Author

I'm have another question for you Jordan: How you would recommend removing a single observation that is clearly an outlier if there are also NA values. I think the MAD function does not work properly if there are NA values also in the series.

  dates <- seq(as.POSIXct('1999-01-01'),by=1,length.out=16)
  values1 <- c(runif(10,2,4),10000, runif(5,2,4))
  values2 <- c(runif(10,2,4),10000, NA, runif(4,2,4))
  data.in1 <- data.frame("DateTime"=dates,"sensor.obs"=values1)
  data.in2 <- data.frame("DateTime"=dates,"sensor.obs"=values2)
  flag(data.in1, "is.na(x)", 'MAD(x) > 3')
  flag(data.in2, "is.na(x)", 'MAD(x) > 3')

The two {flag} commands do not treat the 10000 observation the same.

@jordansread
Copy link
Member

whoops, that is a bug. Good catch! #37

Also note that you can do:

values1 <- c(runif(10,2,4),10000, runif(5,2,4))
values2 <- c(runif(10,2,4),10000, NA, runif(4,2,4))
flag(values1, "is.na(x)", 'MAD(x) > 3')
flag(values2, "is.na(x)", 'MAD(x) > 3')

for really simple quick qaqc

jordansread pushed a commit to jordansread/sensorQC that referenced this issue Oct 22, 2015
@jordansread
Copy link
Member

@lukeloken you should be able to try this after doing another devtools::install_github('usgs-r/sensorQC')

check out the readme for the pattern: https://github.com/USGS-R/sensorQC/blob/master/README.md

Can you let me if that is what you expected? (note that I changed 'ALL' to 'all' as an argument just to be consistent with other packages that is used in)

@lukeloken
Copy link
Author

Jordan, I think this works great. The only comment I have is in regard to the output of {clean}. Is there any reason to keep the structure of this a list rather than unlisting it to a data.frame class? This would save a line of code after processing.

@jordansread
Copy link
Member

I think I need to think on that...
In the meantime, clean(bla bla)$sensor should get you a data.frame, right?

@lukeloken
Copy link
Author

No rush, I was just curious if there was a reason to keep the object class. ...$sensor and [[1]] get to the data.frame.

One reason to keep it as is - It will only print the first 15 observations. Which is very handy.

@jordansread
Copy link
Member

yeah, another little trick to that, is you can use

print(sensor, 20)

to print any number of rows. 15 is the default.

In the future, I want the sensor w/o any flags (so the thing you get back from clean) to behave just like a data.frame, other than the print method. Cool?

@lukeloken
Copy link
Author

Sounds good. That will make it easiest to merge multiple 'cleaned' objects stemming from the same original data file.

@lukeloken
Copy link
Author

Thanks Jordan. The next step for me will be building individual rules for each parameter. Essentially, I'd like to build a table of parameters and rules

variable, rule1, rule2, etc.
pH, 'x > 10', 'x < 4', 'MAD(x) > 3',
CO2, 'x < 0', 'x > 4000', 'MAD(x) > 2',
var3, etc

and then run a loop to apply each set of rules to the appropriate parameter

      file<-read(...)
      for (col in 2:ncol(file)){
        data<-file[,c(1,col)]
        rules<- function(something that looks at names(data[2]) and references the table above
        newdata<-clean(data, rules, replace=NA)
        if (col == 2){
          finaldata<-newdata}
        if (col>2){
          finaldata<-merge(finaldata, newdata, by="times", all=TRUE)}
      }

If you have any suggestions, I'm all ears.

@jordansread
Copy link
Member

looks fine to me. Probably need

newdata<-clean(data, rules, replace=NA)$sensor %>%
  setNames('times',variable)

though to get the merge to work. Otherwise all of the data are going to have the same variable names ('x').

This is probably a more common use case than I imagined it would be originally, so let me know what the pain-points are if you have any.

@lukeloken
Copy link
Author

Hey Jordan,

I think I'm getting somewhere with the automatic code. I may have found another bug in the {clean} function. It does not appear to accept data.frame objects

dates <- seq(as.POSIXct('1999-01-01'),by=1,length.out=500)
values1 <- c(runif(100,-5,12),10000, runif(397,2,100), NA, -9)
file <- data.frame("DateTime"=dates,"sensor.obs1"=values1)
clean(file, 'x > 9999', 'persist(x) > 10', replace=NA)

Is there a quick way to convert a data.frame into a sensor object?
Otherwise I may be better off running vectors through {clean}, and simply binding them together. However this makes me somewhat nervous in case the vectors are different lengths.

#Make Dataset
dates <- seq(as.POSIXct('1999-01-01'),by=1,length.out=500)
values1 <- c(runif(100,-5,12),10000, runif(397,2,100), NA, -9)
values2 <- c(runif(397,2,18), NA, -9, runif(100,-2,36),10000)
file <- data.frame("DateTime"=dates,"sensor.obs1"=values1, "sensor.obs2"=values2)

#Set rules
Variables=names(file)[-1]
Rule1<-c('x < 0', 'x < 0')
Rule2<-c('MAD(x) > 3', 'MAD(x) > 2')
RuleTable<-data.frame(Variables, Rule1, Rule2, stringsAsFactors=FALSE)

col=2
for (col in 2:ncol(file)){
  data<-file[,c(1,col)]
  name<-names(data)[2]
  rules<- unlist(RuleTable[which(RuleTable[,1]==name), 2:3])
  flagged<-flag(data, rules)
  newdata<-clean(flagged, replace=NA)$sensor
  # newdata<-clean(data, rules, replace=NA)$sensor
  names(newdata)<-c('times',name)

  if (col == 2){
    finaldata<-newdata}
  if (col>2){
    finaldata<-merge(finaldata, newdata, by="times", all=TRUE)}
}

@jordansread
Copy link
Member

whoops, @lukeloken I hadn't created the method for clean that accepts a data.frame yet. Expect that in the near(ish) future.

For now, the simple way to go from data.frame to sensor is:

sensor(data.frame(c(1),c(1)))

@jordansread
Copy link
Member

As soon as this #40 is merged, you can do clean(data.frame...). I would suggest keeping them as data.frames so you don't lose time, as that is what you are going to join by.

@lukeloken
Copy link
Author

Hey Jordan,

Thanks for plugging away with SensorQC. I think we are getting somewhere
and I can see the future.

Quick question about MAD. How exactly does the code work? I know what
median absolute deviation is, but how are each of the observations compared
to this value?

Hope you have a good weekend.

Luke

On Fri, Oct 23, 2015 at 3:34 PM, Jordan S Read notifications@github.com
wrote:

As soon as this #40 #40 is
merged, you can do clean(data.frame...). I would suggest keeping them as
data.frames so you don't lose time, as that is what you are going to join
by.


Reply to this email directly or view it on GitHub
#36 (comment).

@lukeloken
Copy link
Author

Nevermind, I think I found the correct code.

abs(x - median(x)) / mad(x)

@lukeloken
Copy link
Author

image

Above Figure, pH from Lake Mendota. Red values= raw data, Black = 'cleaned' data after running MAD(x) > 3

So I have been running the MAD and some basic min/max expressions to remove outliers. In this example, the MAD function may not be appropriate as it excludes values when we sampled for extended periods of time in different areas of the lake (see above). This occurs because we drove our boat into Cherokee Marsh where the pH was drastically different than the rest of the lake.

I'm thinking of additional rules to apply to {flag} and {clean} functions. One method that comes to mind is applying the MAD function of other statistical tests over a rolling window. Something similar to {runmean} in the 'caTools' package. Any thoughts on approaches to using a local outlier test rather than a global one? I realize this will increase the processing time, but it may be more appropriate for data strings that oscillate among multiple values. I could also see this being valid for stationary timeseries (e.g., rain event trigger high NO3 and discharge for several days).

@jordansread
Copy link
Member

yeah, MAD should be used with a window if the data aren't expected to be stationary, so pretty much always.

You can use MAD with window as MAD(x,w) > 3, but you have to window the data first. There is an example of that in the readme, but I don't think I have implemented the manual windowing yet, which is probably what you want. It is another column in the sensor called 'w'.

@lukeloken
Copy link
Author

Hey Jordan,

I've been able to get the window function to work, but only with type='auto' and after generating a sensor object. It would be great to be able to 1) Window a data.frame object and 2) specify the window width (type=10~50 measurements or seconds). Hopefully we can pass this windowed object through {clean}/{flag} with 'MAD(w)>3'.

I'm curious how your window procedure works? Specifically does it look both directions and how does it handle the last/first observations?

# col = column number of data values (2:ncol(datatable)
data<-datatable[,c(1,col)]
sensored<-sensor(data)
windowed<-window(sensored, type=10)
cleanedframe[,col]<-clean(windowed, 'MAD(w)>3', replace=NA)$sensor[,2]

@jordansread
Copy link
Member

@lukeloken window currently just windows data based on breaks in time, which is the auto argument. Manual is supposed to take a time argument for window spacing (this would be more appropriate for continuous data).

Another option should be to pass in a vector of window indices. I don't know if you looked at the sensor$sensor$w vector, but it is just a vector of indices that are used to group observations. Passing in something to take w vector should be supported, that way you can easily customize your own windows (maybe you w/ FLAME would do them more with space than time...).

Which is a priority to be implemented first? manual windowing based on time, or adding your own w vector? Or other ideas here?

@lukeloken
Copy link
Author

@jread-usgs I see what your current window function does. It is designed for burst data in which you capture 30 secs of 1hz data every 30 minutes. This obviously does not work for flame data, which is continually collecting at 1hz.

I would vote for building customized exclusion vectors. I've started a function below to compute a rolling MAD, but you probably have a faster/better way to do it. This is computationally heavy, but it seems like a good place to start (at least for my type of data). If the output of this function became the w slot, we could use 'w > 3' to flag.

One problem I've already encountered is with lots of observations of the same value. For example, In many of the windows for pH, more than half the observations are identical, making mad=0 and MAD=Inf. Certainly don't want to exclude observations for this reason.

x <- data[,2] # list of x values
window <- 20 # size of window in both directions to compute MAD


#Rolling MAD function. Windows subsequent observations and computes MAD statistic from local observations

rollingMAD<-function (x, window){

# Empty vector for computed MAD values
WindowedMAD<-rep(NA, length(x)) 

for (obs in 1:length(x)){
  # Set interval of observations to compute MAD
  # For observations near start and end only include real observations
  seq<-(obs-window):(obs+window)
  seq<-seq[which(seq<=length(x) & seq>=1)]
  interval<-x[seq]

  # Compute MAD
  localmedian<-median(interval, na.rm=TRUE)
  localmad<-mad(interval, na.rm=TRUE)
  WindowedMAD[obs]<-abs(x[obs]-localmedian)/localmad
}

return (WindowedMAD)
}

We might also think about including some of the running sd code from CaTools
page 35 https://cran.r-project.org/web/packages/caTools/caTools.pdf

@lukeloken
Copy link
Author

Made some progress on the rollingMAD function and cleaning a flame dataframe. {clean(x, 'w>3'} will work after loading the local MAD values into the 'w' slot.

  sensored$sensor$w <- rollingMAD(sensored$sensor$x, window)
  cleaned <- clean(sensored, rules, replace=NA)$sensor[,2] 

No rush on looking at these, as it seems to be working at the moment. In the future, I'd be interested in how to increase processing speed.

# ==========================================================
# Rolling MAD function. 
# Windows subsequent observations and computes local MAD statistic
# ==========================================================

rollingMAD<-function (x, window){

  # Create empty vector for computed MAD values
  WindowedMAD<-as.numeric(rep(NA, length(x)))

  for (obs in 1:length(x)){
    # Set interval of observations
    # truncate interval for observations near start and end
    seq<-c((obs-window):(obs+window))
    seq<-seq[which(seq<=length(x) & seq>=1)]
    interval<-x[seq]

    # Compute MAD
    localmedian<-median(interval, na.rm=TRUE)
    localmad<-mad(interval, na.rm=TRUE)
    WindowedMAD[obs]<-abs(x[obs]-localmedian)/localmad
  }

  # Replace infinite values (mad==0) with NA
  WindowedMAD[which(is.infinite(WindowedMAD))]<-NA

  # Return vector of windowed MAD values.
  # Equal length to x
  return (WindowedMAD)
}

# ==========================================================
# Function to clean multiple parameters of a dataframe
# Col1 = times; col2 and up == variables
# Requires a data.frame of rules that match datatable column names
# window = number of observations to include in both directions for rollingMAD
# ==========================================================

sensorclean<-function(datatable, ruletable, window){

# Create emtpy datatable to fill with cleaned values. 
# Keep only column 1 (times)
cleanedframe<- datatable
cleanedframe[,-1]<-NA

#Clean all variables in datatable starting with column 2
for (col in 2:ncol(datatable)){

  # Create sensor object and fill 'w' slot with rollingMAD
  data<-datatable[,c(1,col)]
  sensored<-sensor(data)
  sensored$sensor$w<-rollingMAD(sensored$sensor$x, window)

  # Set rules manually or from ruletable and clean sensor object
  # rules<-as.character(c('w>3', 'is.na(x)'))
  rules<- unlist(ruletable[which(ruletable[,1]==names(data)[2]), -1])
  if (length(rules)==0){
    cleanedframe[,col]<-data[,2]}
  if (length(rules)>0){  
    cleanedframe[,col]<-clean(sensored, rules, replace=NA)$sensor[,2] 

    # Plot flagged (red) and retained (black) observations 
    flags<-data[,2]
    flags[which(flags==cleanedframe[,col])]<-NA
    plot(cleanedframe[,col], col="black", ylab=names(data)[2], xlab="time", ylim=range(data[,2], na.rm=TRUE))
    points(flags, col="red", pch=16)
  }
}

return(cleanedframe)
}

# =====================================================
# Workflow to clean FLAMe data after loading above functions
# =====================================================

# setwd("C:/Users/Luke/Dropbox/FLAME/Data/2015-08-24_SparklingLake")
datatable<-read.csv("SparklingLake2015-08-24.csv", header=TRUE, stringsAsFactors=FALSE)
datatable$ltime<-as.POSIXct(datatable$ltime, format="%Y-%m-%d %H:%M:%S", tz="UTC")
ruletable<-read.csv("C:/Users/Luke/Dropbox/FLAME/SensorQC/SensorQCRules_2015.csv", header=TRUE, stringsAsFactors=FALSE)
window <- 15 # size of window in both directions to compute MAD

CleanFLAME<-sensorclean(datatable, ruletable, window)

Example output from {sensorclean}. Red=removed values
Not perfect, but it's a start.

image

@jordansread
Copy link
Member

Nice! Looking good. Thanks for taking a stab at this. Happy to see if I can help the speed-up.

@jordansread
Copy link
Member

@lukeloken how slow is this for you? Can you give me an indication of the processing time it takes for a reasonably sized dataset so I can see where the starting point is?

@lukeloken
Copy link
Author

On my Corei5 with 6gb RAM
The dimensions of datatable are 30210 by 47. This is one of our larger data files
Is there somewhere online I can drop my ruletable and datatable so you can play with my code?

It takes about 14s to perform {rollingMAD} on one column of data.

sensored$sensor$w<-rollingMAD(sensored$sensor$x, Medianwindow, MADwindow)

To loop through all 46 columns and replace with NA using {sensorclean} takes about 5 minutes. This can obviously be shortened by not plotting, but for first pass it seems useful to see what is removed.

CleanFLAME<-sensorclean(datatable, ruletable)

Hilary just suggested looking at the {tsoutlier} package.

library(tsoutliers)
fit <- arima(data[,2], order = c(1, 1, 0))
resid <- residuals(fit)
pars <- coefs2poly(fit)
outliers <- locate.outliers(resid, pars,types = c('AO'),cval=25)

plot(data[,1],data[,2],cex=0.5,pch=19,col='red', ylab=names(data)[2], xlab="")
points(data[outliers$ind,1],data[outliers$ind,2],cex=0.5,pch=19,col='black')

Also a little slow, but might be another method for assigning flags.

@jordansread
Copy link
Member

cool, I hadn't seen tsoutliers, looks like a good package. I am going to take a closer look at that.

I am trying out your code/data now.

@jordansread
Copy link
Member

harder than I thought to make this fast. Was trying to dplyr::lead() or lag before going into MAD, but not having much luck with that (although it would be really fast if it worked).

@jordansread
Copy link
Member

RcppRoll::roll_median(datatable$ChlARFU,30) is a fast potential solution. Looking into how flexible that is

@jordansread
Copy link
Member

Rccp::rollit_raw let's you roll your own.

@lukeloken
Copy link
Author

And I love rolling it raw!

@jordansread
Copy link
Member

@lukeloken I will have something for you to test (if you are up for it) shortly.

@jordansread
Copy link
Member

1.5 hrs...

@lukeloken
Copy link
Author

Happy to be a guinea pig

@jordansread
Copy link
Member

Give it a shot? #42

you can devtools intstall_github from my fork (jread-usgs) and get this feature. See the readme here for an example on using the rolling window. It is centered right now by default.

@lukeloken
Copy link
Author

Hey Jordan,

The code works and is much faster than my previous version. Is there any way I can see the code used to to calculate {MAD.roller}

A few issues: I can tell that if MAD==Inf, the value becomes flagged. Also, I'm finding that the threshold of 3 does not work well for a lot of the data. I've increase the threshold to 10 and it does a reasonable job for some parameters but poorly for others (see figures)

image

image

I'm now starting to wonder how useful the MAD function is for moving data. Perhaps other outlier detection methods are needed.

Code to plot and assess rolling window outlier detection

for (col in 2:ncol(datatable)) {

sensor<-sensor(datatable[,c(1,col)])

sensor2<-window(sensor, n=40, type='rolling')
flagged<-flag(sensor2, 'x == 999999', 'MAD(x,w) > 10', 'MAD(x) > 3')

outliers<-flagged$flags[[2]]$flag.i
flagname<-flagged$flags[[2]]$expression

# plot(sensor2)
par(mar=c(4,4,1,1))
plot(sensor2$sensor$times, sensor2$sensor$x, col="black", type="p", pch=1, cex=1,main=names(datatable)[col], ylab=names(datatable)[col], xlab="")
points(sensor2$sensor[outliers, ], col="red", pch=16, cex=1)
legend("topleft", c(flagname), col="red", pch=16, cex=1, bty="n")

}

@jordansread
Copy link
Member

wow those are some pretty figs 😉

Code for MAD.roller is here

I put that warning in there because I haven't really tested it w/ different things (like a bunch of NAs or Inf).

Do you think that rolling window is big enough? I was having good luck with much longer periods - like 100 or 300, but that is for station data.

There is also an issue here when data have trends. When doing windows in the past, I have detrended and then operated on it. It is another comp cost to doing that, but helps if you really just have a trend in the window that creates a wider spread in the data.

@lukeloken
Copy link
Author

I've been using a handful of window sizes and thought smaller would be better. This allows quick peaks to remain unflagged. Otherwise the tops of the mountains get cut off because the median remains on the valley floor.

I think that detrending could be useful. The first step in the tsoutliers package is building an ARIMA model then using the residuals as an indicator of outliers.

library(tsoutliers)
fit <- arima(data[,2], order = c(1, 1, 0))
resid <- residuals(fit)
pars <- coefs2poly(fit)
outliers <- locate.outliers(resid, pars,types = c('AO'),cval=25)

plot(data[,1],data[,2],cex=0.5,pch=19,col='red', ylab=names(data)[2], xlab="")
points(data[outliers$ind,1],data[outliers$ind,2],cex=0.5,pch=19,col='black')

I haven't used these methods before, but am interested to dig into them if they seem like a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants