ATTINY85 combining I2C and Pin Change Interrupts - possible race condition? #774
Replies: 6 comments
-
A race condition caused by an interrupt during another ISR is a can't happen - this is a classic AVR, the only time that interrupt priority is used is to decide which interrupt to run IFF more than 1 enabled interrupt flag is set while interrupts are enabled (and thus outside of an interrupt), so it needs to choose which one to run. But the I flag is cleared by hardware when entering an ISR and set upon leaving it, so we can't have an interrupt in the middle of the USI ISR because unless we specifically enable interrupts they're disabled while in an ISR. (Enabling them in an interrupt opens up a jumbo can of worms, we do not support this mode of operation and all behavior of the core is undefined - it might work, it might not, it is not tested, and should not be used) Make sure you guard writes to multibyte variables or anything that could cause a race condition when (ex, even if it's only 8 bits in size, if your code assumes that no interrupt could fire between when it read one register and wrote a different one, that could break unless the code only executes inside an interrupt). Basically anything that is read from an interrupt and written from outside of one needs to guard the writes by disabling interrupts, and anything that is written from an interrupt and read from outside must be declared volatile, and if it is written in both contexts and read in one or both contexts, both guards are needed. But that should ensure that you don't get garbage values from it, This is not like modern AVRs where there's a lvl 1 interrupt (optionally), in addition to the normal lvl 0 interrupts, and the level 1 interrupt can interrupt the other interrupts. When I combine an AVR an an Espressif MCU, I give each of them an I/O line to the other's reset pin, so if either of them hangs (though usually it's only the ESP that has problems |
Beta Was this translation helpful? Give feedback.
-
Hi Spence, Thanks for the insights, much appreciated. I need to do some deeper digging into the Wire library, but what happens when the initial USI_START interrupt is serviced - does it exit the ISR and re-enable global interrupts while it waits for further bytes to come in, or does it leave interrupts disabled all the way until my onRequest function is complete? If the former, isn't there an opportunity of a PCINT0 interrupt to come in and 'get in front' of the expected flow of the I2C transaction, thus potentially messing up the messaging between slave and master? I had considered adding an extra IO from the ESP to '85 to hard reset the chip if it went AWOL, but this thing is already built on a PCB. I may yet do that if I can't get to the bottom of the lockups (revision 2 of the design I guess). |
Beta Was this translation helpful? Give feedback.
-
I've done a little reading into the Wire library code and I now believe that it is possible for a PCINT0 to 'jump the queue' while Wire is servicing an I2C transaction. Note that this isn't a fault of Wire, it's just my particular use case with my pulse reader application. From my understanding (please correct me if I'm wrong), the USI generates multiple interrupts for the various states of the I2C transaction. In my particular case where the master requests 2 bytes from slave address 0x04:
At the completion of each of the 8 steps above interrupts will be enabled on exit from the ISR. If a pin change interrupt occurs at any point in the above process would it not be serviced as soon as the preceding USI interrupt is exited? Unless I've missed something, this could have the potential to disrupt the I2C data transaction and maybe leave the slave I2C bus in an unknown state (corrupt data, hung bus etc)?
Perhaps 'race condition' was the wrong term to use (I was already aware that interrupting an interrupt wasn't possible on the AVRs). But I wonder if in my particular case the multi-staged interrupt behaviour of the Wire library is being interrupted in between interrupts. If that is the case, is there an elegant way to ensure the I2C data transaction is protected from interruptions from pin changes so that it has the chance to respond to the master in its entirety (or at least die trying)? I can tolerate a pulse missed here or there; locking up the I2C databus and having to hit the panic button on the ATTiny85 less so. Using the list above, it would be ideal to be able to 'reach in' to the library and detect when steps 1 and 8 occur so I can disable and re-enable pin change interrupts. |
Beta Was this translation helpful? Give feedback.
-
How many of those are actually activated in our implementation though? Interrupt How do you ensure that you do not put the AVR into sleep mode during a request? The Wire library used on pre-revolutionary AVRs is completely different under the hood from post-revolutionary AVRs that do, with much effort, finally work properly with sleep modes. While the post-revolutionary parts present the same API (well, we had to add a few extensions to the API, some to expose new features on modern AVRs plus I think 2-3 to fill holes in the stock API. I think 2-3 were added because the stock arduino API is insufficient to make quality I2C slave firmware; at least one of those extensions was added (then found to be only a partial fix, requiring yet another rewrite of that functionality) for the express purpose of making it possible to reliably enter sleep on slave without interfering with any transaction. The solution is still less than one would hope for - you have to spam sleep until you are able to put the part to sleep (it does idle sleep during transfers, so the end of the xfer wakes it up, then immediately goes back into idle sleep mode. Repeat until it's done sending and restores the sleep mode you wanted originally, and uses that. I don't think that's a great solution, but I think it's the the best that can be done), and that's on a modern AVR's TWI module, which is head and shoulders above the classic one. And head, shoulders, arms torso, and thighs above the Unhelpful serial interface on the classic tinies. I suspect if you did not put the t85 to sleep, you would find that the problem did not occur. This is I think the most likely cause of the problem and hence should be your first test. You must not put the device to sleep while in the middle of a transaction. I am unaware of any way to detect that on the classic tinyAVR USI, you need to use other information available to your code and to yourself to regulate when you put the chip to sleep and ensure that it won't be in a transfer, based on your knowledge of how the master operates. If this is the problem, it is a "Won't Fix" on ATTinyCore. Only the post revolutionary AVRs have that working, and I don't believe any classic AVR has ever had a Wire library that had an interlock against entering sleep during a transfer. The fix was was implemented for modern AVRs by someone who is a very talented AVR programmer (and unlike me, he understands the compiler toolchain - He points me to files and line numbers in avr-gcc source when I ask questions about things, as if I'm supposed to be able to understand it, makes it sound easy. Man, I can't even recognize the programming language the code was written in (it's certainly not any language I've ever written or seen before - hell I can't even make the toolchain build scripts create a working linker. That's why the binaries distributed in all my toolchain packages look familiar - they're from Arduino7 version of it. What's different are all the part-specific files which I upgrade to the latest ATpacks with a rebuild (and the buildscript does do this, because once it builds the binaries (including the non-functional linker) they are just making .o files and .a files to improve compile time; it doesn't have to link anything). He had already rewritten the entire wire library there (which I had already made major changes to to reduce flash usage, which started off really bad, >4k, I'd managed to shave enough from that that a 4k part could fit a really simple bit of code using wire). While adding multiple major features he was also able to slash the binary size further, such that very simple uses of wire are possible in 2k. It was found that even extending the API to give you a call to see if you were in a transaction or not is not sufficient for full reliability, and the person who encountered it was very good at writing and running testcases. Note that the Wire library used on megaTinyCore and DxCore are thesame library - their TWI peripheral is nearly identical (there are like 2-3 extra features on some of the larger parts, one of them very big, which was part of the impetus for the rewrite and which was not my work, and the other 1-2 added features on non-tiny modern AVRs are just flags, and I tacked on a function to set the simple ones once he'd gotten most of it working. But yeah, if this is what's happening, it is A) not worth fixing for the older tinies considering the amount of work, which I would estimate at weeks to months, since there are THREE VERSIONS OF WIRE that will all have the same issue, it's a tricky issue to fix, and it's broken everywhere except on the two other cores which sharea single version of wire written by someone more talented than me. End of interrupt First, let's do some Easy tests.
In 1 and 2, a key concept is whether it fails just as much as the control (that is, without change). If it still fails but fails significantly less, that points to any interrupt being able to trigger the problem, including millis, In that case, go to Step 2B + 2C Skip 2B and 2C if 1 and 2 both eliminated the issue, as you've proven the cause is related to the pulse interrupt. If they did not impact frequency of issue, or if they reduced but did not eliminate it, you should do steps 2B and 2C. 2B. Could the problem becaused by "any" interrupt during upload? One easy way to find out: You can still use delayMicroseconds and if you need a longer delay, you can also: Run this (so it's just take I2C transactions and sending back data indicating it hasn't seen any pulses, and no timekeeping, and only USI interrupts. if this results in errors, the problem has nothing to do with other interrupts, because it still occurs without any. Skip all the rest of the tests, because you've proven that the things they test for cannot possibly be the problem. Go to After the Tests 2C. If 1/2 error, but 2B does not: leave millis disabled, but resume feeding and listening for pulses. If this also does not error, but reenabling millis causes the error, skip the rest of the tests. Go to after the tests. Interlude: If 2B (running without timekeeping on) did not see errors but 1 and 2 did see errors (particularly if they saw them more rarely), then we know that the bug occurs when an interrupt overlaps with a USI slave transaction. But we still have no idea how to fix it or where in the code to look for the problem. But now it is only an issue of characterization. It will require adding debug code to the Wire library, as well as to the sketch. If running without millis timekeeping If 2C does not error, you've shown millis interrupts, not pin interrupts to be the cause of the problem. Skip the tests under 3, go to After the Tests. 3A. Characterizing the interrupt-sensitivity of USI interrupt (setup) Sketch: add the below line to onRequest handler, and the start of setup.
(this is faster than using a normal variable, and also very fast to set single bits in, and that reduces the chance that this disrupts this clearly timing dependent bug. ). Wire library: Locate the USI slave section of the library (this will be confusing, as there are a shitton of Wire implementations on this core: There's the Real TWI one used by the t88 only.There's the Slave-TWI + software I2C master used by the t841 and t828. There's the USI master + USI slave used by everything else, except the 1634, which if i'm not mistaken gets USI master and SlaveTWI instead of USI slave (it has the hardware for either but I think the slave-only TWI is better than a USI as slave. The Unhelpful Serial Interface is a pretty crappy peripheral. While more robust than pure software I2C, it's surprisingly close, despite one being hardware assisted and the other not. Once i'd located the proper section of that library, for diagnostics, I would then trace out all the code paths USI slave request could take, and which separate interrupt vectors were used. In each of those places, I would add a So to get data is simple even if you have no channel for normal realtime debugging; back in your sketch, in your pulse detect ISR add this line:
Now, at the point in time that the pulse detect ISR occurred, GPIOR0 containing the current state of the I2C transaction. Usually 0 (no transaction). When it fires between steps of the USI ISR, however, it GPIOR will be non-zero.making GPIOR1 nonzero too. Now we check for that in loop (obviously, also #include EEPROM to write eeprom with - we're using the EEPROM because the other ways of getting information out are frequently more trouble, and present their own hazards):
Disabling interrupts during this will ensure that GPIOR1 doesn't get changed whileyou do this. Ensure that the ESP is reporting the results of every request to which the tinyAVR responds. You may want to make it wake more frequently to reproduce the problem faster. 3B - Running the characterization Run the setup with the changes outlined above. Once the sketch hangs, disconnect I2C and everything from it except power (we want it to stay hung, connect your preferred ISP programmer (set to run at the same voltage) and use avrdude from the command line to read EEPROM and store in a file. Also save and examine the results from the console output generated by the ESP. A find command can likely be used to locate any errors, especially if both you and your text editor know regular expressions. A couple of the possible observations here are a smoking gun:
And repeat until either you get the case below, or you end up with the single bad response at the end of the ESP's log (which is the case above. You may or may not have by this point rules out all other places the it could be in a request, but not an interrupt.
In case 1 and 3: In case 2, ideally, repeat until you've confirmed it's hung at every point in the process when interrupts are enabled, In case 2, you don't know, but you've learned one point during the interrupt process that doesn't appear to cause the bug. So amend the part in loop to something like (if say you found 0x07 in the EEPROM):
In case 1, you find that a poorly timed interrupt does trigger the corrupt data, furthermore, through repeated tests, you've found which times are and are not vulnerable. This will be of great utility in moving towards understanding how this is happening. In case 2, if you come across every value of GPIOR1 that is possible(this number should be quite small) and they have nevercaused crashes, see case 3. In case 3, or 2 after further investigationas noted inthe previous sentence, we are forced toconclude that some other mechanism must be the thing that causes the corrupted data. In all cases, I'd really like to look over the code + results to see if there are any "obvious" (not necessarily obvious to most people, even most arduino-users).
Background on GPIORs, and why they are so handy for diagnostics That all is wonderful but there is another reason that I use GPIORs so much in debugging: Their scope is Global and then some. Say I'm debugging a private class member function that operates on private member variables, called by another private member function called by the public wrapper method. All of these functions return void. How the fuck do you get any debugging data out (let's ignore hardware debugging, I can't even get the only programing tools that can do this (microchip official ones) to be detected by MPLAB. Nobody who I've felt like I could ask a noob question like that does hardware debugging either. (yeah, this core is 100% developed with no hardware debugging - I can't make the tooling work!). You're 3 functions deep, none of them return anything, and putting a Serial print into many oddly behaving functions can change their behavior and make debugging more difficult (Serial.print is quite an inefficient function, is fairly slow, and can even in soime circumstances switch from non-blocking to blocking. Often the easiest way to smuggle data out for debugging is to simply write it to the rarely used GPIORs, and then check for it later, as we did in this example. |
Beta Was this translation helpful? Give feedback.
-
I'm using the boolean The '85 sleeps most of the time (has to in my application). It doesn't** sleep midway through a USI service. I specifically had to add this boolean test because, in early breadboarding, I2C would fall over at the first hurdle unless the chip stayed awake for the full I2C message. ** or shouldn't - I've just noticed another potential conflict in that the pulse debouncing section in the main loop also uses
I'm not sure I'm asking for a fix as such. To 'fix it' might introduce behaviour that other users might find problematic (essentially trying to manhandle the pecking order of interrupt priority to suit my own nefarious purposes - there might be genuine reasons you want USI to be aborted partway through). I guess all I'm seeking is if there's a way to temporarily mask away PCINT0 within the first USI ISR call that doesn't involve me using a one-off customised variant of the bundled Wire library in ATTinyCore***? *** PS - I've already done the necessary tweaks to Wire to make a one-off variant that does the required masking, so I do have a backup plan if the answer is 'no'. Although it does feel like I've used a Sherman tank to collect the groceries... PPS - That's all contingent on whether my new observation, above, that several things can fiddle with the |
Beta Was this translation helpful? Give feedback.
-
As far as I can tell, all of them. For a slave-address-request-plus-two-byte-response transfer I count 8 calls of either |
Beta Was this translation helpful? Give feedback.
-
Hi there,
I'm currently working on a electricity meter pulse counter using an ATtiny85 (+ATTinyCore) as an I2C slave to an ESP8266. The '85 does the counting of pulses and then sends a tally of pulses back to the ESP via I2C every few minutes for forwarding on to an MQTT broker. The whole thing is battery powered and to conserve energy the '85 sleeps in between meter pulses/I2C requests from the ESP, and the ESP only wakes up to request the latest batch of pulse counts from the '85 every 5 minutes. The whole thing seems to work flawlessly 95% of the time. It's this last 5% of reliability that I'm concerned about.
Recently I've become aware that the '85 I2C will occasionally glitch (sends a byte of 0xFF to the ESP) or hang completely (SDA stuck low). When this occurs I can rescue the '85 by briefly pulling the Reset pin to ground and I2C resumes working again, but obviously this level of manual intervention is undesirable for something that's meant for unattended operation. The period of time the system works before having a brain fart seems to vary - might be OK for several weeks or a couple of days.
What I suspect is happening is that when the '85 is counting pulses a Pin Change Interrupt may coincide with the ESP requesting data via I2C. As per the '85 datasheet PCINT0 has higher priority than USI_START, and may be upsetting the Wire library midway through a transaction after being woken from sleep. To get around this, what I think needs to happen is that when the '85 is woken up by a USI_START condition the Pin Change Interrupts need to be temporarily disabled as early as possible in the Wire library ISR itself until the I2C transaction is complete.
My thought was to somehow redefine ISR(USI_START_VECTOR) in TWI_Slave.c to include
GIMSK &= ~bit(PCIE)
to disable PCINT0, with a correspondingGIMSK = bit(PCIE)
once the transaction is done.I've checked all the obvious stuff - ATTiny85 boot fuses configured correctly, external 4k7 pullup resistors fitted on SDA and SCL, pulse counter inputs on the '85 have 10k pulldowns and are clean squarewaves when scoped, it's definitely the '85 I2C freezing not the ESP etc
My slave code for reference:
Cheers.
Beta Was this translation helpful? Give feedback.
All reactions