The SOI is essentially measuring the dipole extent of Tahiti minus Darwin (T-D) atmospheric pressure. If these two are perfect anti-nodes, then the correlation coefficient (CC) between the two should be -1. For unfiltered monthly data, the CC is almost -0.6 (and almost to -0.8 if the respective time series are smoothed). This means that either the SOI does not represent a perfect dipole or there may be some non-dipole noise contaminating the data (or both). Yet, it remains a good enough measure since the T-D filter removes much of the common-mode noise by subtraction.
For the daily SOI measure (available from [Aus BOM](https://www.longpaddock.qld.gov.au/soi/soi-data-files/) from 1991-current), the CC between Tahiti and Darwin hovers around +/- 0.0 if left unfiltered. This is not good for the same reason (noise) as the monthly set, only that the daily noise seems to be even worse. Again, only if filtered over a two week time, does the CC start approaching -0.5.
Since I am interested in finding out how much I can push the daily data to isolate the higher-order wave-numbers that seem to contribute to the standing wave, I started experimenting with biased estimating techniques. One rather obvious biased approach is to compare the model point-by-point against T-D, 2T, and -2D and select the value that is closest to the model fit. This will always reduce the error of any fit to the data because unless we have a perfect dipole where T-D=2T=-2D, the selected value will always be smaller or at most equal to T-D.
The overall rationale for trying this is similar to the rationale for using a 3-point median filter -- the value closest to the model value is less likely to be an outlier impacted by a spurious noise excursion. As the following fit shows, the technique does work, by transforming the mix of uncorrelated and correlated signal towards a more correlated time-series aligning well with the model. Evident as a background fuzzy envelope, the possible maximal excursions for 2T and -2D can be gauged -- these are often much greater than T-D in absolute value, but get chosen wherever the model also deviates too far from the T-D value.
This may not be an optimal approach, but given the poor anti-correlated properties of the daily SOI pair values there may not be any other options. One other approach I can think of is to maximize the Double-Sideband Carrier Modulation in the Fourier series of the raw data.
EDIT: After letting this gestate for a day or two, I'm not sure what value this approach is providing as the addition of a massive number of DOF will obviously improve the fit. One could just as easily introduce two additional random numbers for each point and by cherry-picking for the best fit, the CC will improve. .... But then again consider this recent paper in PNAS (https://phys.org/news/2019-12-el-nio-event-year.html) that claims that an El Nino is more likely to occur after a year of "high disorder". The issue is that disorder has a broad distribution, while strict order doesn't. That doesn't seem a strong argument -- welcome back to the world of climate science at the bleeding edge!
This is what the CC improvement looks like on the SOI monthly time-series (CC=0.65 improved to 0.85). The error bars indicate the extent of the (T-D, 2T, -2D) triplet values not chosen. In other words, one of these is closest to the model value and thus selected as the "most valid" data point, and the other two are rejected, replaced by error bars.