Sevvandi Kandanaarachchi
https://sevvandi.netlify.com/
Sevvandi KandanaarachchiSource Themes Academic (https://sourcethemes.com/academic/)en-usSat, 22 Oct 2022 00:00:00 +0000https://sevvandi.netlify.com/img/icon-192.pngSevvandi Kandanaarachchi
https://sevvandi.netlify.com/
Anomaly Detection Ensembles
https://sevvandi.netlify.com/post/2022-03-16-anomaly-detection-ensembles/
Sat, 22 Oct 2022 00:00:00 +0000https://sevvandi.netlify.com/post/2022-03-16-anomaly-detection-ensembles/
<script src="https://sevvandi.netlify.com/post/2022-03-16-anomaly-detection-ensembles/index.en_files/header-attrs/header-attrs.js"></script>
<div id="what-is-an-anomaly-detection-ensemble" class="section level2">
<h2>What is an anomaly detection ensemble?</h2>
<p>It is a bunch of anomaly detection methods put together to get a final anomaly score/prediction. So you have a bunch of methods, and each of these methods have its own anomaly score, which is used by the ensemble to come up with the consensus score.</p>
<p>What are the ways of constructing an anomaly detection ensemble? Broadly, anomaly detection ensembles can be categorised into 3 camps.</p>
<ol style="list-style-type: decimal">
<li>Feature bagging</li>
<li>Subsampling</li>
<li>Using combination functions</li>
</ol>
</div>
<div id="feature-bagging" class="section level2">
<h2>Feature bagging</h2>
<p>Feature bagging is a very popular ensemble technique in anomaly detection. Feature bagging uses different attribute subsets to find anomalies. In a dataset, generally observations are denoted by rows and attributes are denoted by columns. Feature bagging considers different column subsets. That is, multiple copies of the same dataset each having a slightly different set of columns is considered. For each dataset copy, we find anomalies using a single anomaly detection method. Then the anomaly scores are averaged to compute the ensemble score.</p>
<p>Let us try this with the letter dataset from the <a href="http://odds.cs.stonybrook.edu/">ODDS repository</a>. We first read the dataset and normalize it so that each column has values within 0 and 1. Let’s have a loot at the data after normalising.</p>
<pre class="r"><code>datori <- readMat("letter.mat")
Xori <- datori$X
Xori <- unitize(Xori)
head(Xori)</code></pre>
<pre><code>## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 0.40000000 0.6666667 0.33333333 0.6 0.20000000 0.7 0.3076923 0.30769231
## [2,] 0.00000000 0.4000000 0.00000000 0.4 0.00000000 0.4 0.3846154 0.30769231
## [3,] 0.26666667 0.4666667 0.33333333 0.5 0.20000000 0.4 0.4615385 0.15384615
## [4,] 0.06666667 0.4000000 0.06666667 0.4 0.13333333 0.4 0.3846154 0.00000000
## [5,] 0.06666667 0.1333333 0.06666667 0.3 0.06666667 0.4 0.3846154 0.07692308
## [6,] 0.06666667 0.3333333 0.00000000 0.7 0.00000000 0.4 0.3846154 0.30769231
## [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
## [1,] 0.6 1.0 0.1818182 0.3636364 0.1333333 0.6428571 0.3636364 0.8888889
## [2,] 0.4 0.3 0.4545455 0.5454545 0.0000000 0.5714286 0.0000000 0.5555556
## [3,] 0.7 0.3 0.4545455 0.6363636 0.0000000 0.5714286 0.3636364 0.5555556
## [4,] 0.7 0.3 0.4545455 0.5454545 0.0000000 0.5714286 0.2727273 0.5555556
## [5,] 0.7 0.3 0.4545455 0.5454545 0.0000000 0.5714286 0.2727273 0.5555556
## [6,] 0.4 0.3 0.4545455 0.5454545 0.0000000 0.5714286 0.0000000 0.5555556
## [,17] [,18] [,19] [,20] [,21] [,22] [,23]
## [1,] 0.26666667 0.6666667 0.33333333 0.8888889 0.14285714 0.5 0.5714286
## [2,] 0.20000000 0.4666667 0.26666667 0.5555556 0.14285714 0.5 0.4285714
## [3,] 0.06666667 0.2666667 0.00000000 0.2222222 0.00000000 0.5 0.4285714
## [4,] 0.13333333 0.1333333 0.06666667 0.3333333 0.07142857 0.5 0.4285714
## [5,] 0.06666667 0.2666667 0.06666667 0.3333333 0.07142857 0.5 0.4285714
## [6,] 0.06666667 0.6666667 0.00000000 0.7777778 0.00000000 0.5 0.4285714
## [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31]
## [1,] 0.00000000 0.5714286 1.0000000 0.5 0.3076923 0 0.7500000 0.18181818
## [2,] 0.00000000 0.5000000 0.9285714 0.5 0.4615385 0 0.5833333 0.09090909
## [3,] 0.07142857 0.5000000 0.5000000 0.5 0.4615385 0 0.5833333 0.18181818
## [4,] 0.07142857 0.5714286 0.5000000 0.5 0.4615385 0 0.5833333 0.27272727
## [5,] 0.07142857 0.5714286 0.5000000 0.5 0.5384615 0 0.5833333 0.27272727
## [6,] 0.28571429 0.2857143 0.5000000 0.5 0.4615385 0 0.5833333 0.00000000
## [,32]
## [1,] 0.5
## [2,] 0.5
## [3,] 0.6
## [4,] 0.6
## [5,] 0.6
## [6,] 0.6</code></pre>
<p>Now, feature bagging would select different column subsets. Let’s pick different columns.</p>
<pre class="r"><code>set.seed(1)
dd <- dim(Xori)[2]
sample_list <- list()
for(i in 1:10){
sample_list[[i]] <- sample(1:dd, 20)
}
sample_list[[1]]</code></pre>
<pre><code>## [1] 25 4 7 1 2 23 11 14 18 19 29 21 10 32 20 30 9 15 5 27</code></pre>
<pre class="r"><code>sample_list[[2]]</code></pre>
<pre><code>## [1] 9 25 14 5 29 2 10 31 12 15 1 20 3 6 26 18 19 23 4 24</code></pre>
<p>Next we select the subset of columns in each sample_list and find anomalies in each subsetted-dataset. Let’s use the KNN_AGG anomaly detection method. This method aggregates the k-nearest neighbour distances. If a data point has high KNN distances compared to other points, then it is considered anomalous, because it is far away from other points.</p>
<pre class="r"><code>library(DDoutlier)
knn_scores <- matrix(0, nrow = NROW(Xori), ncol = 10)
for(i in 1:10){
knn_scores[ ,i] <- KNN_AGG(Xori[ ,sample_list[[i]]])
}
head(knn_scores)</code></pre>
<pre><code>## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 21.635185 22.270694 24.987923 20.026546 22.158682 22.422737 20.455542
## [2,] 9.969831 9.661039 9.432406 5.255449 5.540366 7.325621 9.302664
## [3,] 16.061851 20.796525 11.832477 14.235700 11.481942 14.899982 13.096920
## [4,] 12.409446 14.012729 11.400507 7.347249 7.739870 9.309485 9.793680
## [5,] 9.998477 12.541672 10.209202 6.857942 5.919889 9.355743 8.214986
## [6,] 12.350209 12.350209 7.409608 4.905166 5.045559 3.310723 7.461390
## [,8] [,9] [,10]
## [1,] 23.55476 21.493597 25.954645
## [2,] 10.51396 7.460010 8.571801
## [3,] 17.98096 9.872211 19.625926
## [4,] 11.78520 7.757524 11.292598
## [5,] 12.24067 6.955316 11.085984
## [6,] 12.29468 3.638703 7.127627</code></pre>
<p>Now we have the anomaly scores for the 10 subsetted-datasets. In feature bagging the general method of consensus is to add up the scores or take the mean of the scores, which is an equivalent thing to do.</p>
<pre class="r"><code>bagged_score <- apply(knn_scores, 1, mean)</code></pre>
<p>We can compare the bagged anomaly scores with the anomaly scores f we didn’t use bagging. That is, if we used the full dataset, what would be anomaly scores? Does bagging make it better? For this we need the labels/ground truth. To evaluate the performance, we use the area under the ROC curve.</p>
<pre class="r"><code>library(pROC)
labels <- datori$y[ ,1]
# anomaly scores without feature bagging - taking the full dataset
knn_agg_without <- KNN_AGG(Xori)
# ROC - without bagging
rocobj1 <- roc(labels, knn_agg_without, direction = "<")
rocobj1$auc</code></pre>
<pre><code>## Area under the curve: 0.9097</code></pre>
<pre class="r"><code>rocobj2 <- roc(labels, bagged_score, direction = "<")
rocobj2$auc</code></pre>
<pre><code>## Area under the curve: 0.9101</code></pre>
<p>Yes! We see that there is an increase in AUC (area under the ROC curve) by feature bagging. In this case it is a small improvement. But, nonetheless there is an improvement.</p>
</div>
<div id="subsampling" class="section level2">
<h2>Subsampling</h2>
<p>Subsampling uses different subsets of observations to come up with anomaly scores. Instead of columns, here we use different subsets of observations. Then we average the anomaly scores to get an ensemble score. First, let’s get the different observation samples. For this, we will use non-anomalous observations because the anomalous observations are rare, we don’t want to use all of them.</p>
<pre class="r"><code>set.seed(1)
sample_matrix <- matrix(0, nrow = NROW(Xori), ncol = 10)
inds0 <- which(labels == 0)
nn1 <- sum(inds0)
inds1 <- which(labels == 1)
nn2 <- sum(inds1)
sample_matrix[inds1, ] <- 1
for(j in 1:10){
sam <- sample(inds0, 1400)
sample_matrix[sam, j] <- 1
}
head(sample_matrix)</code></pre>
<pre><code>## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 1 1 1 0 1 1 1 1 1
## [2,] 1 1 1 1 1 1 1 1 1 1
## [3,] 1 1 1 1 1 1 1 1 1 1
## [4,] 1 1 1 1 1 1 1 1 1 1
## [5,] 1 1 1 0 1 1 1 1 1 1
## [6,] 0 1 1 1 1 1 0 1 1 1</code></pre>
<p>Our sample_matrix contains 1 if that observation is going to be used and 0 if it doesn’t. We are going to use 10 subsampling iterations.</p>
<p>Now that we have our subsamples, let’s use an anomaly detection method to get the anomaly scores.</p>
<pre class="r"><code>anom_scores <- matrix(NA, nrow = NROW(Xori), ncol = 10)
for(j in 1:10){
inds <- which(sample_matrix[ ,j] == 1)
Xsub <- Xori[inds, ]
anom_scores[inds,j] <- KNN_AGG(Xsub)
}
head(anom_scores)</code></pre>
<pre><code>## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 30.13047 31.70672 31.38430 30.13047 NA 30.13047 30.13047 30.13047
## [2,] 11.46522 11.46522 11.51939 11.70227 11.63900 12.39053 11.46522 11.46522
## [3,] 22.95041 23.53152 22.95041 23.07765 22.95041 22.95041 23.67323 23.52891
## [4,] 15.17929 16.05668 15.17929 15.76817 15.17929 15.78328 16.57122 15.44892
## [5,] 13.57168 15.51147 13.60095 NA 13.57168 15.51147 14.02113 15.04368
## [6,] NA 13.46472 12.58007 13.07503 13.07503 12.70142 NA 12.58007
## [,9] [,10]
## [1,] 30.13047 30.13047
## [2,] 11.46522 11.51939
## [3,] 23.07765 22.95041
## [4,] 15.20815 15.20815
## [5,] 15.04368 15.04368
## [6,] 12.75612 12.66090</code></pre>
<p>We see there are NA values when that observation was not selected. Now we will get the mean anomaly score. But some observations did not got selected for certain iterations. We need to take that into account.</p>
<pre class="r"><code>rowsum <- apply(sample_matrix, 1, sum)
subsampled_score <- apply(anom_scores, 1, function(x) sum(x, na.rm = TRUE))/rowsum
head(subsampled_score)</code></pre>
<pre><code>## [1] 30.44492 11.60967 23.16410 15.55824 14.54660 12.86167</code></pre>
<p>Here, we have divided the sum of the anom_scores by the number of times each observation was selected. The ensemble score is subsampled_score. Now we can see if the ensemble score is better than the original score.</p>
<pre class="r"><code># ROC - without bagging
rocobj1 <- roc(labels, knn_agg_without, direction = "<")
rocobj1$auc</code></pre>
<pre><code>## Area under the curve: 0.9097</code></pre>
<pre class="r"><code>rocobj2 <- roc(labels, subsampled_score, direction = "<")
rocobj2$auc</code></pre>
<pre><code>## Area under the curve: 0.9073</code></pre>
<p>Oh dear! Not really for this example. But it doesn’t go down by much, which is a relief. Sometimes the ensemble is not better than the original model. But most of the time it is. That is why we use ensembles.</p>
</div>
<div id="using-combination-functions" class="section level2">
<h2>Using combination functions</h2>
<p>For both the above examples we used the average as the combination function. That is, given a set of scores for each observation, we averaged them. But we can do different things. For example, we can get the maximum instead of the average. Or we can get the geometric mean which is multiplying all of the scores and getting the Nth root of them. As examples, let’s try those two functions: the maximum and the geometric mean. Let’s use knn_scores in the bagging example.</p>
<pre class="r"><code>max_score <- apply(knn_scores, 1, max)
head(max_score)</code></pre>
<pre><code>## [1] 25.95465 10.51396 20.79653 14.01273 12.54167 12.35021</code></pre>
<pre class="r"><code>prod_score <-(apply(knn_scores, 1, prod))^(1/10)
head(prod_score)</code></pre>
<pre><code>## [1] 22.427853 8.097505 14.597109 10.059545 9.070620 6.828194</code></pre>
<p>Let’s see if this is better than taking the average (mean).</p>
<pre class="r"><code>rocobj1 <- roc(labels, bagged_score, direction = "<")
rocobj1$auc</code></pre>
<pre><code>## Area under the curve: 0.9101</code></pre>
<pre class="r"><code># ROC - Max
rocobj1 <- roc(labels, max_score, direction = "<")
rocobj1$auc</code></pre>
<pre><code>## Area under the curve: 0.9129</code></pre>
<pre class="r"><code>rocobj2 <- roc(labels, prod_score, direction = "<")
rocobj2$auc</code></pre>
<pre><code>## Area under the curve: 0.908</code></pre>
<p>Well! Taking the maximum improved it a bit, but taking the geometric mean reduced it a bit. Interesting!</p>
<p>But, combination functions are not generally used this way! In this example we used feature bagging anomaly scores and different combination functions. And we only used a single anomaly detection method all the time. Generally, combination functions are used when multiple anomaly detection methods are used. That way, it makes sense to weight methods differently to get an ensemble score.
## Combination functions with multiple anomaly detection methods
So far, we’ve only looked at one anomaly detection method. Let’s take multiple anomaly detection methods from the DDoutlier package.</p>
<pre class="r"><code>knn <- KNN_AGG(Xori)
lof <- LOF(Xori)
cof <- COF(Xori)
inflo <- INFLO(Xori)
kdeos <- KDEOS(Xori)
ldfsscore <- LDF(Xori)
ldf <- ldfsscore$LDE
ldof <- LDOF(Xori)</code></pre>
<p>Now we have tried 7 anomaly detection methods. Let’s see different combination functions.</p>
<ol style="list-style-type: decimal">
<li>Mean</li>
<li>Maximum</li>
<li>Geometric Mean</li>
<li>An IRT-based method</li>
</ol>
<p>The IRT-based method is discussed in detail in this <a href="https://www.sciencedirect.com/science/article/abs/pii/S0020025521012639">paper</a>. Basically, there is a fancier combination method which uses item response theory under the hood.</p>
<p>Let’s try these methods and get ensemble scores.</p>
<pre class="r"><code>library(outlierensembles)
anomaly_scores <- cbind.data.frame(knn, lof, cof, inflo, kdeos, ldf, ldof)
mean_ensemble <- apply(anomaly_scores, 1, mean)
max_ensemble <- apply(anomaly_scores, 1, max)
geom_mean_ensemble <- (apply(anomaly_scores, 1, prod))^(1/7)
irt_mod <- irt_ensemble(anomaly_scores)</code></pre>
<pre><code>## Warning in sqrt(diag(solve(Hess))): NaNs produced</code></pre>
<pre class="r"><code>irt_ensemble <- irt_mod$scores</code></pre>
<p>Let’s evaluate the different ensembles.</p>
<pre class="r"><code>rocobj1 <- roc(labels, mean_ensemble, direction = "<")
rocobj1$auc</code></pre>
<pre><code>## Area under the curve: 0.76</code></pre>
<pre class="r"><code># ROC - Max
rocobj1 <- roc(labels, max_ensemble, direction = "<")
rocobj1$auc</code></pre>
<pre><code>## Area under the curve: 0.7613</code></pre>
<pre class="r"><code>rocobj1 <- roc(labels, geom_mean_ensemble, direction = "<")
rocobj1$auc</code></pre>
<pre><code>## Area under the curve: 0.406</code></pre>
<pre class="r"><code>rocobj1 <- roc(labels, irt_ensemble, direction = "<")
rocobj1$auc</code></pre>
<pre><code>## Area under the curve: 0.904</code></pre>
<p>For this example, IRT ensemble performs best. The mean and the max ensembles perform similarly. The geometric mean ensemble performs very poorly. I just included the geometric mean as it is another function, but actually it is not used in anomaly detection ensembles.</p>
<p>As you see, there are different methods of ensembling. You can use feature bagging, subsampling or you can use a different combination function. And you can do more than one thing. You can do feature bagging and use a different combination function. You can do bagging and subsampling together. A lot of options!</p>
</div>
Anomaly detection in dynamic networks
https://sevvandi.netlify.com/preprint/2022-01-01_anomaly_detection_in/
Sat, 15 Oct 2022 00:00:00 +0000https://sevvandi.netlify.com/preprint/2022-01-01_anomaly_detection_in/Short-term prediction of stream turbidity using surrogate data and a meta-model approach
https://sevvandi.netlify.com/preprint/2022-01-01_short-term_predictio/
Tue, 11 Oct 2022 00:00:00 +0000https://sevvandi.netlify.com/preprint/2022-01-01_short-term_predictio/Honeyboost: Boosting honeypot performance with data fusion and anomaly detection
https://sevvandi.netlify.com/publication/2022-05-06_honeyboost_boosting_/
Fri, 06 May 2022 00:00:00 +0000https://sevvandi.netlify.com/publication/2022-05-06_honeyboost_boosting_/Evaluating Algorithm Portfolios using Item Response Theory
https://sevvandi.netlify.com/talk/2022_optima/
Wed, 04 May 2022 13:00:00 +0000https://sevvandi.netlify.com/talk/2022_optima/From ensembles to computer networks
https://sevvandi.netlify.com/talk/2022_data61/
Thu, 21 Apr 2022 13:00:00 +0000https://sevvandi.netlify.com/talk/2022_data61/Unsupervised anomaly detection ensembles using Item Response Theory
https://sevvandi.netlify.com/publication/2021-02-05_unsupervised_anomaly/
Tue, 01 Mar 2022 00:00:00 +0000https://sevvandi.netlify.com/publication/2021-02-05_unsupervised_anomaly/Getting better at detecting anomalies by using ensembles
https://sevvandi.netlify.com/talk/2022_anziam/
Mon, 07 Feb 2022 13:00:00 +0000https://sevvandi.netlify.com/talk/2022_anziam/Benchmarking Algorithm Portfolio Construction Methods
https://sevvandi.netlify.com/publication/2022-01-01_benchmarking_algorit/
Sat, 01 Jan 2022 00:00:00 +0000https://sevvandi.netlify.com/publication/2022-01-01_benchmarking_algorit/Leave-one-out kernel density estimates for outlier detection
https://sevvandi.netlify.com/publication/2021-10-21_leave-one-out_kernel/
Wed, 22 Dec 2021 00:00:00 +0000https://sevvandi.netlify.com/publication/2021-10-21_leave-one-out_kernel/Di Cook's podcast on data visualization and reproducibility
https://sevvandi.netlify.com/podcasts/2021-10-25-dicook-podcast/
Mon, 25 Oct 2021 00:00:00 +0000https://sevvandi.netlify.com/podcasts/2021-10-25-dicook-podcast/
<script src="https://sevvandi.netlify.com/podcasts/2021-10-25-dicook-podcast/index_files/header-attrs/header-attrs.js"></script>
<p><img src="https://sevvandi.netlify.com/img/di-sk-podcast.jpg" />
In this episode I spoke to <a href="https://www.dicook.org/">Prof Di Cook</a> about data visualization and reproducibility in research. Di spoke about the challenges in visualizing data in high dimensions and how she became interested in data visualization. She also spoke about the importance of reproducible research. Di has a great phrase for reproducibility, “if I can do it, you can do it too!” Di says that sums it up.</p>
<p>As usual Tim Macuga of ACEMS helped me with this podcast. This was all done on Zoom. This podcast is available at <a href="https://acems.org.au/podcast/episode-64-di-cook" class="uri">https://acems.org.au/podcast/episode-64-di-cook</a></p>
Evaluating Algorithms using Item Response Theory.
https://sevvandi.netlify.com/talk/2021_emcr/
Thu, 14 Oct 2021 13:00:00 +0000https://sevvandi.netlify.com/talk/2021_emcr/Anomalies! You can't escape them.
https://sevvandi.netlify.com/talk/2021_rladies/
Thu, 16 Sep 2021 18:00:00 +0000https://sevvandi.netlify.com/talk/2021_rladies/Kate Smith-Miles' podcast on OPTIMA and industry
https://sevvandi.netlify.com/podcasts/2021-09-08-kate-optima-podcast/
Wed, 08 Sep 2021 00:00:00 +0000https://sevvandi.netlify.com/podcasts/2021-09-08-kate-optima-podcast/
<script src="https://sevvandi.netlify.com/podcasts/2021-09-08-kate-optima-podcast/index_files/header-attrs/header-attrs.js"></script>
<p><img src="https://sevvandi.netlify.com/img/kate-sk.jpg" />
In this episode I spoke to <a href="https://katesmithmiles.wixsite.com/home">Prof Kate Smith-Miles</a> about <a href="https://optima.org.au/">OPTIMA</a>. OPTIMA is an ARC Training Centre and it stands for Optimisation Technologies, Integrated Methodologies and Applications. Kate spoke a lot about the importance of industry collaborations and how to build successful relationships with the industry.</p>
<p>As usual Tim Macuga of ACEMS helped me with this podcast. This was all done on Zoom. This podcast is available at <a href="https://acems.org.au/podcast/episode-60-optima" class="uri">https://acems.org.au/podcast/episode-60-optima</a></p>
Anomalies and events keep us on our toes!
https://sevvandi.netlify.com/talk/2021_future_of_data_science/
Sun, 22 Aug 2021 12:00:00 +0000https://sevvandi.netlify.com/talk/2021_future_of_data_science/Here is the anomalow-down!
https://sevvandi.netlify.com/talk/2021_qut_seminar/
Thu, 08 Jul 2021 13:00:00 +0000https://sevvandi.netlify.com/talk/2021_qut_seminar/Lookout! Persisting anomalies ahead
https://sevvandi.netlify.com/talk/2021_anzsc/
Mon, 05 Jul 2021 13:00:00 +0000https://sevvandi.netlify.com/talk/2021_anzsc/A chat with Cheryl Praeger
https://sevvandi.netlify.com/podcasts/2021-06-08-cheryl-praegers-podcast/
Tue, 15 Jun 2021 00:00:00 +0000https://sevvandi.netlify.com/podcasts/2021-06-08-cheryl-praegers-podcast/
<script src="https://sevvandi.netlify.com/podcasts/2021-06-08-cheryl-praegers-podcast/index_files/header-attrs/header-attrs.js"></script>
<p><img src="https://sevvandi.netlify.com/img/Cheryl.jfif" />
In this episode I spoke to <a href="https://www.uwa.edu.au/profile/cheryl-praeger">Prof Cheryl Praeger</a> about many things. She spoke about her love for mathematics and told us a bit about her mathematical journey. Cheryl spoke about how the mathematical landscape in Australia has changed over the years. She encouraged people to reach out to the wider community and promote mathematics at all levels.</p>
<p>As usual Tim Macuga of ACEMS helped me with this podcast. This was all done on Zoom as seen from the photo. This podcast is available at <a href="https://acems.org.au/podcast/episode-53-cheryl-praeger" class="uri">https://acems.org.au/podcast/episode-53-cheryl-praeger</a></p>
outlierensembles
https://sevvandi.netlify.com/software/outlierensembles/
Sat, 05 Jun 2021 09:42:00 +0000https://sevvandi.netlify.com/software/outlierensembles/Looking out for anomalies
https://sevvandi.netlify.com/talk/2021_user/
Wed, 26 May 2021 13:00:00 +0000https://sevvandi.netlify.com/talk/2021_user/Early detection of vegetation ignition due to powerline faults
https://sevvandi.netlify.com/publication/2020-07-04_early_detection_of_v/
Sun, 23 May 2021 00:00:00 +0000https://sevvandi.netlify.com/publication/2020-07-04_early_detection_of_v/Anomalies and Algorithms
https://sevvandi.netlify.com/project/algorithms/
Tue, 04 May 2021 00:00:00 +0000https://sevvandi.netlify.com/project/algorithms/<p>Detecting <strong>unusual patterns in data</strong> is really important because they tell a different story from the norm. These unusual data points or anomalies might signify intrusions, fraudulent credit card activities or an abnormal reaction to a vaccine.</p>
<p>I have investigated the following research questions in anomaly detection.</p>
<ol>
<li><p>High dimensional data <br />
When data is high dimensional, anomaly detection becomes more complex.</p></li>
<li><p>Choosing parameters for anomaly detection algorithms <br />
Most anomaly detection algorithms include user-defined parameters. Selecting these parameters carefully is a really important, because for different parameters the algorithms detect different anomalies. So the obvious question, “which anomalies are real anomalies”, arises.</p></li>
<li><p>Anomaly persistence <br />
Are there anomalies that get identified for a large range of parameter values? How can we visualize them?</p></li>
<li><p>Pre-processing techniques for anomaly detection <br />
There are standard pre-processing techniques applied to a dataset before performing anomaly detection. What are the effects of these pre-processing techniques? If you use a different pre-processing technique, will the algorithms detect different anomalies?</p></li>
<li><p>Which anomaly detection method is best suited for my problem? <br />
This is known as the “algorithm selection problem”. It is known that no single algorithm gives superior performance for all problems. This is known as the <em>No Free Lunch</em> theorem. So how would you select the best algorithm for your problem?</p></li>
</ol>
<p>Below are some non-technical summaries of my research.</p>
<p><font size="5"> <strong>Leave-one-out kernel density estimates for outlier detection</strong></font> <br />
Sevvandi Kandanaarachchi, Rob J Hyndman <br />
preprint (2021) <br />
<img src="persistence.png" alt="half-size image" />
There are many density-based methods for anomaly detection. Kernel density estimates are used to compute the density of points in a data cloud. The kernel density estimation algorithms have a parameter called the bandwidth. Anomaly detection algorithms that use kernel density estimates ask the user to input the bandwidth parameter. We use a different branch of mathematics – topological data analysis – to compute the bandwidth for anomaly detection. We call this algorithm <em>lookout</em>. We’ve made the R package <em>lookout</em> available and details can be found at <a href="https://sevvandi.github.io/lookout/index.html" target="_blank">https://sevvandi.github.io/lookout/index.html</a>.</p>
<p>We also look at anomaly persistence. That is, when we change the bandwidth gradually, how will the anomalies change? We explore this using a persistence diagram, which is shown at the top. The anomalies identified by lookout are shown in dark red on the left. On the right, we see that the same anomalies persist for a large range of bandwidth values.</p>
<p><font size="5"> <strong>Dimension reduction for outlier detection using DOBIN</strong> </font> <br />
Sevvandi Kandanaarachchi, Rob J Hyndman <br />
JCGS (2020) <br />
<img src="lesmis.png" alt="half-size image" />
Detecting anomalies in high dimensions is a challenge. Often people use low dimensional representations of the data to detect anomalies. But, there is an issue. How would you know that an anomaly in the high dimensional space is still an anomaly in the low dimensional representation? This is what we address in this paper. We find a low dimensional representation of the data so that anomalies in the original space are still anomalies in the low dimensional space. We call this algorithm <em>dobin</em>. (The R package dobin is available at <a href="https://sevvandi.github.io/lookout/index.html" target="_blank">https://sevvandi.github.io/lookout/index.html</a>. )</p>
<p>The figure at the top shows the characters is <em>Les Miserables</em> transformed to a low dimensional space using dobin. We see that Valjean is quite anomalous compared to the other characters.</p>
<p><font size="5"> <strong>On normalization and algorithm selection for unsupervised outlier detection</strong> </font> <br />
Sevvandi Kandanaarachchi, Mario A Munoz, Rob J Hyndman, Kate Smith-Miles <br />
Data Mining and Knowledge Discovery (2020)<br />
<img src="distribution_svm_portfolio.png" alt="half-size image" />
A standard pre-processing technique for anomaly detection is to normalize the dataset. We explore different normalization techniques and their effects on different anomaly detection methods. The most used normalization method is called Min-Max. We show that Min-Max is only suited for about 50% of the datasets in our repository. So, effectively we are showing that the normalization method should not be fixed, but should be chosen depending on the dataset.</p>
<p>We also explore the algorithm selection problem for anomaly detection. That is, we answer the question, which anomaly detection algorithm is best suited for my problem? We also construct an instance space of these test problems, which is shown in the figure above. In this figure each point denotes a dataset and the color represents the best anomaly detection algorithm suited for it.</p>
Cyber Security
https://sevvandi.netlify.com/project/security/
Mon, 03 May 2021 00:00:00 +0000https://sevvandi.netlify.com/project/security/<p>Cyber attacks are increasingly common in today’s world. Generally speaking, attacks or intrusions can be identified by comparing with known attack signatures or by using anomaly detection. Using signature-based methods is really effective in identifying known attacks; but they cannot identify new attacks. Hence, the importance of anomaly detection.</p>
<p>A honeypot aims to lure attackers in a computer network. You can read more about honeypots at <a href="https://us.norton.com/internetsecurity-iot-what-is-a-honeypot.html" target="_blank">https://us.norton.com/internetsecurity-iot-what-is-a-honeypot.html</a> But honeypots are almost never standalone security measures. We look at ways of increasing honeypot performance.</p>
<p><font size="5"> <strong>Honeyboost: Boosting honeypot performance with data fusion and anomaly detection</strong></font> <br />
Sevvandi Kandanaarachchi, Hideya Ochiai, Asha Rao <br />
preprint (2021) <br />
<img src="lan_pic.png" alt="half-size image" /></p>
<p>How do you identify anomalous network traffic? Can you identify anomalous devices in a computer network? We explore these questions in this paper which is available at <a href="https://arxiv.org/abs/2105.02526" target="_blank">https://arxiv.org/abs/2105.02526</a>.</p>
Dimension reduction for outlier detection using DOBIN
https://sevvandi.netlify.com/publication/2019-01-01_dimension_reduction_/
Sat, 01 May 2021 00:00:00 +0000https://sevvandi.netlify.com/publication/2019-01-01_dimension_reduction_/lookout
https://sevvandi.netlify.com/software/lookout/
Sat, 13 Feb 2021 21:00:00 +0000https://sevvandi.netlify.com/software/lookout/Join us
https://sevvandi.netlify.com/joinus/
Sun, 07 Feb 2021 00:00:00 +0000https://sevvandi.netlify.com/joinus/
<p>If you’re interested in working with me on any of these research areas, get in touch.</p>
<h2 id="detecting-interesting-events-from-sensor-data">Detecting interesting events from sensor data</h2>
<p><img src="NASA.jpg" alt="half-size image" /></p>
<p>Sensors, sensors, sensors! There is a massive amount of sensor data available to the general public from various organisations including NASA, ESA and local governments. Some data from NASA or ESA satellites are on the whole world. The picture above shows the aerosol data from a NASA satellite for a specific month.</p>
<p><img src="melbourne_peds.png" alt="half-size image" /></p>
<p>Some data are more localised. The picture above shows the count sensor data from the pedestrian counting system in Melbourne at a certain time (<a href="http://www.pedestrian.melbourne.vic.gov.au/" target="_blank">http://www.pedestrian.melbourne.vic.gov.au/</a>). Whichever type of sensor data you’re looking at, you can detect interesting events from this data giving analysts, decision makers and the general public useful insights. Developing mathematical, statistical and machine learning models to detect these events in a robust manner is of interest to many stake holders.</p>
<h2 id="anomalies-in-networks">Anomalies in Networks</h2>
<p><img src="Network.jpg" alt="half-size image" />
Many application such as Cyber Security and Social Networks have dynamic network structures. Detecting anomalous events in these complex structures is crucial due to the high impact nature of these applications. For example, an anomalous event in an IoT network may signify an intrusion. New methodology is needed to detect such events in these dynamic environments.</p>
<h2 id="methodology-for-anomaly-detection">Methodology for anomaly detection</h2>
<p>Often, we need to take a step back from the application and formulate the problem using mathematical and statistical concepts. Developing methodologies for anomaly detection in different mathematical settings gives more flexibility than concentrating on a specific application. Many times, the application serves as a starting point to explore a bigger research problem.</p>
<h2 id="information-for-prospective-students">Information for prospective students</h2>
<p>If you’re interested in working with me on a PhD or an MSc, you need to have a degree in Mathematics, Statistics, IT or Engineering. In addition, some programming skills are also needed.</p>
<p>An interest to learn more mathematics and statistics is essential, because a PhD or an MSc is a journey into the unknown exploring new territories.</p>
<p>I’m always happy to chat with prospective MSc and PhD students.</p>
Sensor data
https://sevvandi.netlify.com/project/sensors/
Sun, 07 Feb 2021 00:00:00 +0000https://sevvandi.netlify.com/project/sensors/<p>Sensor data can be of a spatio-temporal nature. Below are some non-technical summaries of my research.</p>
<p><font size="5"> <strong>Early classification of spatio-temporal events using partial information</strong></font> <br />
Sevvandi Kandanaarachchi, Rob J Hyndman, Kate Smith-Miles <br />
PLoS ONE (2020) <br />
<img src="NO2_Event_Clusters.png" alt="half-size image" />
How do we detect <strong>events of interest</strong> in spatio-temporal data? For example high density clusters of Nitrogen Dioxide may be of interest. Or high density clusters of aerosols may herald an impending bushfire. We present an event extraction method capable of extracting such events. We also discuss an algorithm for event classification using partial data. These algorithms are available in the R package <em>eventstream</em> and details can be found at <a href="https://sevvandi.github.io/eventstream/index.html" target="_blank">https://sevvandi.github.io/eventstream/index.html</a></p>
<p><font size="5"> <strong>Early detection of vegetation ignition due to powerline faults</strong></font> <br />
Sevvandi. Kandanaarachchi, Nandini. Anantharama, Mario. A. Munoz <br />
IEEE Transactions on Power Delivery (2020) <br />
<img src="bushfires.jpg" alt="Courtesy of The Bendigo Advertiser" />
On a scorchingly hot summer’s day a branch falls on a powerline. This can spark a bushfire ravaging the countryside. We predict ignition from branches coming into contact with powerlines before the branch catches fire. Thus we detect it early, before the event. Early detection can help prevent bushfires. This paper resulted from the initial work that we did with the Victorian Government as part of the Vegetation Detection Challenge (VDC). We were awarded the second prize at the VDC and it is featured at <a href="https://www.energy.vic.gov.au/safety-and-emergencies/powerline-bushfire-safety-program/research-and-development-grants/vegetation-detection-challenge" target="_blank">https://www.energy.vic.gov.au/safety-and-emergencies/powerline-bushfire-safety-program/research-and-development-grants/vegetation-detection-challenge</a></p>
<p><font size="5"> <strong>Predicting sediment and nutrient concentrations from high-frequency water-quality data</strong></font> <br />
Catherine Leigh, Sevvandi Kandanaarachchi, James M. McGree, Rob J. Hyndman, Omar Alsibai, Kerrie Mengersen, Erin E. Peterson <br />
PLoS ONE (2019) <br />
<img src="water_sensor.jpg" alt="Flickr image" /></p>
<p>Monitoring water in our rivers is important for many stake holders including the Government. We use water quality sensor data to predict turbidity, a measure of water clarity and the Nitrites present in water. This project was a collaboration with the Queensland Government. At the time of this project, the standard practice was to get water samples manually from the rivers and test it. So, a person had to go to the site, get a sample of water and then take it to the lab and test it. Our work showed that it can be automated by using in-situ sensors, that measure similar quantities, enabling more frequent predictions. This work was featured at <a href="https://www.arc.gov.au/news-publications/media/research-highlights/revolutionising-water-quality-monitoring-our-rivers-and-reef" target="_blank">https://www.arc.gov.au/news-publications/media/research-highlights/revolutionising-water-quality-monitoring-our-rivers-and-reef</a></p>
<p><font size="5"> <strong>A framework for automated anomaly detection in high frequency water-quality data from in situ sensors</strong></font> <br />
Catherine Leigh, Omar Alsibai, Rob J Hyndman, Sevvandi Kandanaarachchi, Olivia C King, James M McGree, Catherine Neelamraju, Jennifer Strauss, Priyanga Dilini Talagala, Ryan DR Turner and others <br />
Science of The Total Environment (2019) <br />
<img src="water_anomaly.jpg" alt="R image" />
Finding anomalies in water quality data using sensors was part of the same project we did with the Queensland Government. The anomalies from the sensors might be caused by many things including a dying battery or a sudden change upstream. Therefore, it is important to detect these anomalies quickly.</p>
Testing an Outlier Detection Method
https://sevvandi.netlify.com/post/2021-02-06-how-to-test-an-outlier-detection-method/
Sat, 06 Feb 2021 00:00:00 +0000https://sevvandi.netlify.com/post/2021-02-06-how-to-test-an-outlier-detection-method/
<script src="https://sevvandi.netlify.com/post/2021-02-06-how-to-test-an-outlier-detection-method/index.en_files/header-attrs/header-attrs.js"></script>
<p>Suppose you have developed an outlier detection method. What are the ways to test it? You can generate some random data and add a couple of outliers and see if your method gives high outlier scores to the outliers. Let’s try this out with a couple of well known outlier detection methods from the R package DDoutlier.</p>
<pre class="r"><code>knitr::opts_chunk$set(echo = TRUE, cache=TRUE, message=FALSE)
library(DDoutlier)
library(pROC)</code></pre>
<pre><code>## Type 'citation("pROC")' for a citation.</code></pre>
<pre><code>##
## Attaching package: 'pROC'</code></pre>
<pre><code>## The following objects are masked from 'package:stats':
##
## cov, smooth, var</code></pre>
<pre class="r"><code>library(ggplot2)
library(tidyr)</code></pre>
<p>We will use 3 methods from the DDoutlier package, KNN, LOF and COF. We will compute the outlier scores and use the area under the Receiver Operator Characteristic (ROC) Curve to see the accuracy of these methods.</p>
<pre class="r"><code>set.seed(1)
X1 <- data.frame(x1=rnorm(500), x2=rnorm(500))
oo <- data.frame(x1=rnorm(5, mean=5), x2=rnorm(5, mean=5))
X <- rbind.data.frame(X1, oo)
labs <- c(rep(0, 500), rep(1, 5))
X <- cbind.data.frame(X, labs)
ggplot(X, aes(x=x1, y=x2, color=as.factor(labs))) + geom_point() + theme_bw()</code></pre>
<p><img src="https://sevvandi.netlify.com/post/2021-02-06-how-to-test-an-outlier-detection-method/index.en_files/figure-html/ex1p1-1.png" width="672" /></p>
<pre class="r"><code># Outlier Detection Methods
knn_scores <- DDoutlier::KNN_AGG(X)
lof_scores <- DDoutlier::LOF(X)
cof_scores <- DDoutlier::COF(X)
# ROC Curve
knn_roc <- pROC::roc(labs, knn_scores)
knn_roc</code></pre>
<pre><code>##
## Call:
## roc.default(response = labs, predictor = knn_scores)
##
## Data: knn_scores in 500 controls (labs 0) < 5 cases (labs 1).
## Area under the curve: 1</code></pre>
<pre class="r"><code>lof_roc <- pROC::roc(labs, lof_scores)
lof_roc</code></pre>
<pre><code>##
## Call:
## roc.default(response = labs, predictor = lof_scores)
##
## Data: lof_scores in 500 controls (labs 0) < 5 cases (labs 1).
## Area under the curve: 0.9884</code></pre>
<pre class="r"><code>cof_roc <- pROC::roc(labs, cof_scores)
cof_roc</code></pre>
<pre><code>##
## Call:
## roc.default(response = labs, predictor = cof_scores)
##
## Data: cof_scores in 500 controls (labs 0) < 5 cases (labs 1).
## Area under the curve: 0.8212</code></pre>
<p>This example had rather obvious outliers. Next, we look at the case when outliers slowly move out from the main distribution. To do this, we consider several iterations of the same experiment. For the first iteration, the outliers are at the boundary of the normal distribution. With each iteration, the outliers move out, bit by bit. We do this with a parameter <span class="math inline">\(\mu\)</span>, that starts at 2 and increases by 0.5 in each iteration. Which method gives better performance then?</p>
<pre class="r"><code>set.seed(1)
knn_auc <- lof_auc <- cof_auc <- rep(0, 10)
for(i in 1:10){
X1 <- data.frame(x1=rnorm(500), x2=rnorm(500))
mu <- 2 + (i-1)/2
oo <- data.frame(x1=rnorm(5, mean=mu, sd=0.2), x2=rnorm(5, mean=mu, sd=0.2))
X <- rbind.data.frame(X1, oo)
labs <- c(rep(0, 500), rep(1, 5))
X <- cbind.data.frame(X, labs)
# Outlier Detection Methods
knn_scores <- DDoutlier::KNN_AGG(X)
lof_scores <- DDoutlier::LOF(X)
cof_scores <- DDoutlier::COF(X)
# Area Under ROC = AUC values
# KNN
roc_obj <- pROC::roc(labs, knn_scores, direction ="<")
knn_auc[i] <- roc_obj$auc
# LOF
roc_obj <- pROC::roc(labs, lof_scores, direction ="<")
lof_auc[i] <- roc_obj$auc
# COF
roc_obj <- pROC::roc(labs, cof_scores, direction ="<")
cof_auc[i] <- roc_obj$auc
}
df <- data.frame(Iteration=1:10, KNN=knn_auc, LOF=lof_auc, COF=cof_auc)
dfl <- tidyr::pivot_longer(df, 2:4)
colnames(dfl)[2:3] <- c("Method", "AUC")
ggplot(dfl, aes(x=Iteration, y=AUC, color=Method)) + geom_point() + geom_line() + scale_x_continuous(breaks=1:10) + theme_bw()</code></pre>
<p><img src="https://sevvandi.netlify.com/post/2021-02-06-how-to-test-an-outlier-detection-method/index.en_files/figure-html/ex1Iter-1.png" width="672" />
KNN is performing better than LOF. COF is not performing well at all. Maybe the parameters are not suitable for COF. Is KNN significantly better than LOF? To answer that question, we can repeat this example <span class="math inline">\(n\)</span> times and analyse the results.</p>
<p>Let us consider another example. In this one, the points live in an annulus and the outliers are moving into the hole in each iteration.</p>
<pre class="r"><code>set.seed(1)
r1 <-runif(805)
r2 <-rnorm(805, mean=5)
theta = 2*pi*r1;
R1 <- 2
R2 <- 2
dist = r2+R2;
x = dist * cos(theta)
y = dist * sin(theta)
X <- data.frame(
x1 = x,
x2 = y
)
labs <- c(rep(0,800), rep(1,5))
nn <- dim(X)[1]
knn_auc <- lof_auc <- cof_auc <- rep(0, 10)
for(i in 1:10){
mu <- 5 - (i-1)*0.5
z <- cbind(rnorm(5,mu, sd=0.2), rnorm(5,0, sd=0.2))
X[801:805, 1:2] <- z
# Outlier Detection Methods
knn_scores <- DDoutlier::KNN_AGG(X)
lof_scores <- DDoutlier::LOF(X)
cof_scores <- DDoutlier::COF(X)
# Area Under ROC = AUC values
# KNN
roc_obj <- pROC::roc(labs, knn_scores, direction ="<")
knn_auc[i] <- roc_obj$auc
# LOF
roc_obj <- pROC::roc(labs, lof_scores, direction ="<")
lof_auc[i] <- roc_obj$auc
# COF
roc_obj <- pROC::roc(labs, cof_scores, direction ="<")
cof_auc[i] <- roc_obj$auc
}
X <- cbind.data.frame(X, labs)
# Plot of points in the last iteration
ggplot(X, aes(x1, x2, col=as.factor(labs))) + geom_point()</code></pre>
<p><img src="https://sevvandi.netlify.com/post/2021-02-06-how-to-test-an-outlier-detection-method/index.en_files/figure-html/EX2-1.png" width="672" /></p>
<pre class="r"><code>df <- data.frame(Iteration=1:10, KNN=knn_auc, LOF=lof_auc, COF=cof_auc)
dfl <- tidyr::pivot_longer(df, 2:4)
colnames(dfl)[2:3] <- c("Method", "AUC")
ggplot(dfl, aes(x=Iteration, y=AUC, color=Method)) + geom_point() + geom_line() + scale_x_continuous(breaks=1:10) + theme_bw()</code></pre>
<p><img src="https://sevvandi.netlify.com/post/2021-02-06-how-to-test-an-outlier-detection-method/index.en_files/figure-html/EX2-2.png" width="672" /></p>
<p>We see that KNN > LOF > COF for this example. Again, by repeating the example many times, we can reduce the effect of randomness.</p>
Anomaly detection data repositories
https://sevvandi.netlify.com/post/2021-01-23-anomaly-detection-datasets/
Sat, 23 Jan 2021 00:00:00 +0000https://sevvandi.netlify.com/post/2021-01-23-anomaly-detection-datasets/
<script src="https://sevvandi.netlify.com/post/2021-01-23-anomaly-detection-datasets/index.en_files/header-attrs/header-attrs.js"></script>
<p>In this post we will look at data repositories available for anomaly detection. So, can you use a standard classification dataset for anomaly detection? You can if you <em>downsample</em> one class, preferably the minority class. You can label the downsampled observations as anomalies. If you’re comparing performance for multiple anomaly detection methods, then one downsampled dataset is not enough. It is better to evaluate methods on multiple such versions to account for random sampling. If you’re using anomaly detection methods that need numerical data, the categorical variables in your classification dataset has to be taken care of. Are you going to drop them or convert them to numerical data? In addition you need to take missing values into account. So, let us look at some repositories.</p>
<ol style="list-style-type: decimal">
<li><p><a href="http://odds.cs.stonybrook.edu/">Outlier Detection Datasets - ODDS</a>
This is a really good website that provides multi-dimensional point datasets, time series graph datasets for event detection, time series point datasets, adversarial attack and security datasets and crowd scene video datasets. They describe the datasets well and provide references. It is maintained by Shebuti Rayana.</p></li>
<li><p><a href="https://figshare.com/articles/dataset/Datasets_12338_zip/7705127">Monash figshare outlier detection datasets</a> This repository has 12000 plus anomaly detection datasets that can be downloaded as a zip file. All these datasets are prepared using classification datasets from the UCI repostiory. This data repository was used in the paper <em>On normalization and algorithm selection for unsupervised outlier detection</em>.</p></li>
<li><p><a href="https://elki-project.github.io/datasets/outlier">ELKI outlier detection datasets</a> This repository was used in the paper <em>On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study</em> by Campos et al. It contains over 2000 datasets prepared for anomaly detection.</p></li>
<li><p><a href="https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OPQMVF">Harvard dataverse anomaly detection datasets</a> This respository contains 10 datasets prepared for anomaly detection. It was used in the paper <em>A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data</em> by Goldstein and Uchida.</p></li>
<li><p><a href="https://towardsdatascience.com/adrepository-anomaly-detection-datasets-with-real-anomalies-2ee218f76292">ADRepository - Anomaly Detection Datasets with Real Anomalies</a>This is a GitHub repository maintained by Guansong Pang. It contains 21 datasets. More details are available in the paper <em>Deep learning for anomaly detection: a review</em> by Pang et al. </p></li>
<li><p><a href="https://ir.library.oregonstate.edu/concern/datasets/47429f155">Anomaly Detection Meta-Analysis Benchmarks</a> This repository was used in the paper <em>A Meta-Analysis of the Anomaly Detection Problem</em> by Emmott et al. </p></li>
</ol>
<p>If I haven’t listed your anomaly detection data repository, do let me know and I will include it.</p>
Asha Rao's and Sophie Calabretto's two-part podcast on Mathematics Education and Imposter Syndrome
https://sevvandi.netlify.com/podcasts/2020-12-16-asha-sophies-podcast/
Tue, 15 Dec 2020 00:00:00 +0000https://sevvandi.netlify.com/podcasts/2020-12-16-asha-sophies-podcast/
<script src="https://sevvandi.netlify.com/podcasts/2020-12-16-asha-sophies-podcast/index_files/header-attrs/header-attrs.js"></script>
<p><img src="https://sevvandi.netlify.com/img/Asha_and_Sophies_podcast.jpg" />
In this two-part podcast I interviewed <a href="https://www.rmit.edu.au/contact/staff-contacts/academic-staff/r/rao-professor-asha">Prof Asha Rao</a> from RMIT University and <a href="https://sophiecalabretto.github.io/">Dr Sophie Calabretto</a> from Macquarie University on Mathematics Education and Women in Mathematics. This is an ACEMS Random Sample podcast and the first part is available at <a href="https://acems.org.au/podcast/episode-43-maths-perseverance" class="uri">https://acems.org.au/podcast/episode-43-maths-perseverance</a>.</p>
<p>It was a great chat on why maths is needed, how it can help us in our careers and how we can go about learning mathematics. Asha and Sophie gave different viewpoints cementing the importance of a good mathematics education.</p>
<p>The second part is available at <a href="https://acems.org.au/podcast/episode-44-overcoming-imposter-syndrome" class="uri">https://acems.org.au/podcast/episode-44-overcoming-imposter-syndrome</a>.
In this part we discussed the imposter syndrome and how it affects women. More importantly, we talked about how to increase female participation in mathematics.</p>
Algorithm Evaluation using Item Response Theory
https://sevvandi.netlify.com/talk/2020_austms/
Fri, 11 Dec 2020 15:30:00 +0000https://sevvandi.netlify.com/talk/2020_austms/composits
https://sevvandi.netlify.com/software/composits/
Thu, 10 Dec 2020 00:00:00 +0000https://sevvandi.netlify.com/software/composits/Galit Shmueli's podcast on tech giants hacking our brains
https://sevvandi.netlify.com/podcasts/2020-10-14-galits-podcast/
Wed, 14 Oct 2020 00:00:00 +0000https://sevvandi.netlify.com/podcasts/2020-10-14-galits-podcast/
<script src="https://sevvandi.netlify.com/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p><img src="https://sevvandi.netlify.com/img/Galits_podcast.jpg" />
<a href="https://acems.org.au/our-people/tim-macuga">Tim Macuga</a> from ACEMS and I talked to <a href="https://www.galitshmueli.com/">Prof Galit Shmueli</a> from National Tsing Hua University, Taiwan via zoom and made a podcast. This is available at <a href="https://acems.org.au/podcast/episode-34-galit-shmueli" class="uri">https://acems.org.au/podcast/episode-34-galit-shmueli</a>.</p>
<p>Galit talked about different aspects of statistics and how it is used. She talked about different viewpoints, for example in social sciences and STEM fields.</p>
<p>Galit also talked about the possibility of big tech giants influencing us in ways that we do not think. By using their systems, they have the power to motivate us to do what they want us to do. Whether they do it or not is a different issue.</p>
Outliers in Compositional Time Series Data
https://sevvandi.netlify.com/preprint/2020-09-09_outliers_in_composit/
Wed, 09 Sep 2020 00:00:00 +0000https://sevvandi.netlify.com/preprint/2020-09-09_outliers_in_composit/Which nonlinear dimension reduction methods preserve outliers inside a sphere?
https://sevvandi.netlify.com/post/2020-08-29-nonlinear-dimension-reduction-and-outlier-detection/
Sat, 29 Aug 2020 00:00:00 +0000https://sevvandi.netlify.com/post/2020-08-29-nonlinear-dimension-reduction-and-outlier-detection/
<script src="https://sevvandi.netlify.com/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>In this post we will look at how well nonlinear dimension reduction techniques preserve outliers that are placed inside a sphere, when all the other data points are on the surface of the sphere. We will use the R package <strong>dimRed</strong> for this analysis. First we note that linear projection-based methods on the original data will not work, because projections will hide the outliers inside the sphere. Also, none of the nonlinear dimension reduction methods we consider are especially designed for outlier detection. So, it is not a limitation of the method if they do not preserve outliers. But, we want to see if any of them do.</p>
<p>Let’s plot the data first.</p>
<pre class="r"><code>set.seed(1)
theta <- seq(from = -1*pi, to = pi, by=0.01)
phi <- seq(from = 0, to= pi*2, by=0.01)
theta1 <- sample(theta, size=5*length(theta), replace=TRUE)
phi1 <- sample(phi, size=5*length(phi), replace=TRUE)
x <- cos(theta1)*cos(phi1)
y <- cos(theta1)*sin(phi1)
z <- sin(theta1)
df <- cbind.data.frame(x,y,z)
df1 <- df
dim(df)</code></pre>
<pre><code>## [1] 3145 3</code></pre>
<pre class="r"><code>oo <- matrix(c(0,0,0,0,0.2,-0.1), nrow=2, byrow = TRUE)
colnames(oo) <- colnames(df)
df <- rbind.data.frame(df, oo)
plot(df[ ,c(1,2)], pch=20)
points(df[3146:3147, ], pch=20, col=c("red", "green"), cex=2)</code></pre>
<p><img src="https://sevvandi.netlify.com/post/2020-08-29-nonlinear-dimension-reduction-and-outlier-detection/index_files/figure-html/dataset-1.png" width="672" /></p>
<pre class="r"><code>dd2 <- dimRedData(df)</code></pre>
<p>The dataframe contains 3145 points on the surface of the sphere. These points are non-outliers. The data points in the dataframe on rows 3146 and 3147 are the outliers. We plot the outliers in red and green.</p>
<p>Most methods need a parameter such as <span class="math inline">\(k\)</span> in KNN distances. For all methods we fix <span class="math inline">\(k=10\)</span> and map the original data from <span class="math inline">\(\mathbb{R}^3\)</span> to <span class="math inline">\(\mathbb{R}^2\)</span>. It is also called a 2-dimensional embedding. Let’s start our analysis with IsoMap.</p>
<div id="isomap" class="section level2">
<h2>IsoMap</h2>
<pre class="r"><code>emb2 <- embed(dd2, "Isomap", .mute = NULL, knn = 10)</code></pre>
<pre><code>## 2020-08-30 23:17:23: Isomap START</code></pre>
<pre><code>## 2020-08-30 23:17:23: constructing knn graph</code></pre>
<pre><code>## 2020-08-30 23:17:23: calculating geodesic distances</code></pre>
<pre><code>## 2020-08-30 23:17:27: Classical Scaling</code></pre>
<pre class="r"><code>embdat <- as.data.frame(emb2@data)
plot(embdat, pch=20,main="IsoMap embedding")
points(embdat[3146:3147, ], pch=20, col=c("red", "green"), cex=2)</code></pre>
<p><img src="https://sevvandi.netlify.com/post/2020-08-29-nonlinear-dimension-reduction-and-outlier-detection/index_files/figure-html/isomap-1.png" width="672" /></p>
<p>Well, <span class="math inline">\(k=10\)</span>, did not bring out the outliers inside the sphere using IsoMap. Next, let’s look at Locally Linear Embedding (LLE).</p>
</div>
<div id="lle" class="section level2">
<h2>LLE</h2>
<pre class="r"><code>emb2 <- embed(dd2, "LLE", knn = 10)</code></pre>
<pre><code>## finding neighbours
## calculating weights
## computing coordinates</code></pre>
<pre class="r"><code>embdat <- as.data.frame(emb2@data)
plot(embdat, pch=20, main="LLE embedding")
points(embdat[3146:3147, ], pch=20, col=c("red", "green"), cex=2)</code></pre>
<p><img src="https://sevvandi.netlify.com/post/2020-08-29-nonlinear-dimension-reduction-and-outlier-detection/index_files/figure-html/LLEfun-1.png" width="672" /></p>
<p>Next we look at Laplacian Eigenmaps.</p>
</div>
<div id="laplacian-eigenmaps" class="section level2">
<h2>Laplacian Eigenmaps</h2>
<pre class="r"><code>emb2 <- embed(dd2, "LaplacianEigenmaps", knn = 10)</code></pre>
<pre><code>## 2020-08-30 23:00:51: Creating weight matrix</code></pre>
<pre><code>## 2020-08-30 23:00:52: Eigenvalue decomposition</code></pre>
<pre><code>## Eigenvalues: 5.098969e-03 3.829450e-03 4.537894e-17</code></pre>
<pre><code>## 2020-08-30 23:00:53: DONE</code></pre>
<pre class="r"><code>embdat <- as.data.frame(emb2@data)
plot(embdat, pch=20, main="Lapacian eigenmaps embedding")
points(embdat[3146:3147, ], pch=20, col=c("red", "green"), cex=2)</code></pre>
<p><img src="https://sevvandi.netlify.com/post/2020-08-29-nonlinear-dimension-reduction-and-outlier-detection/index_files/figure-html/laplace-1.png" width="672" />
This is nice. Laplacian eigenmaps really brought out the outliers using <span class="math inline">\(k=10\)</span>. Next we look at diffusion maps.</p>
</div>
<div id="diffusion-maps" class="section level2">
<h2>Diffusion Maps</h2>
<pre class="r"><code>emb2 <- embed(dd2, "DiffusionMaps")</code></pre>
<pre><code>## Performing eigendecomposition
## Computing Diffusion Coordinates
## Elapsed time: 4.67 seconds</code></pre>
<pre class="r"><code>embdat <- as.data.frame(emb2@data)
plot(embdat, pch=20, main="Diffusion maps embedding")
points(embdat[3146:3147, ], pch=20, col=c("red", "green"), cex=2)</code></pre>
<p><img src="https://sevvandi.netlify.com/post/2020-08-29-nonlinear-dimension-reduction-and-outlier-detection/index_files/figure-html/diffusion-1.png" width="672" /></p>
<p>Diffusion maps also brought out the outliers. Interestingly, everything else is mapped to a line. Next we consider non-metric dimensional scaling.</p>
</div>
<div id="non-metric-dimensional-scaling" class="section level2">
<h2>Non-Metric Dimensional Scaling</h2>
<pre class="r"><code>emb2 <- embed(dd2, "nMDS", d = function(x) exp(dist(x)))
embdat <- as.data.frame(emb2@data)
plot(embdat, pch=20, main="non-MDS embedding")
points(embdat[3146:3147, ], pch=20, col=c("red", "green"), cex=2)</code></pre>
<p><img src="https://sevvandi.netlify.com/post/2020-08-29-nonlinear-dimension-reduction-and-outlier-detection/index_files/figure-html/nonmetric-1.png" width="672" /></p>
<p>Finally we do tsne.</p>
</div>
<div id="tsne" class="section level2">
<h2>tsne</h2>
<pre class="r"><code>emb2 <- embed(dd2, "tSNE", perplexity = 10)
embdat <- as.data.frame(emb2@data)
plot(embdat, pch=20, main="tsne embedding")
points(embdat[3146:3147, ], pch=20, col=c("red", "green"), cex=2)</code></pre>
<p><img src="https://sevvandi.netlify.com/post/2020-08-29-nonlinear-dimension-reduction-and-outlier-detection/index_files/figure-html/tsne-1.png" width="672" /></p>
<p>Of the methods explored, Laplacian eigenmaps brought out the outliers while showing some spherical structure in the embedding. Diffusion maps also brought out the outliers. However, the spherical structure of the data was lost in the embedding.</p>
</div>
Bushfire Alert! Branches on Powerlines
https://sevvandi.netlify.com/talk/2020_bernoulli-ims_one_world_symposium/
Sun, 09 Aug 2020 16:10:00 +0000https://sevvandi.netlify.com/talk/2020_bernoulli-ims_one_world_symposium/Anomaly detection dilemmas
https://sevvandi.netlify.com/post/2020-07-19-anomaly-detection-datasets/
Sun, 19 Jul 2020 00:00:00 +0000https://sevvandi.netlify.com/post/2020-07-19-anomaly-detection-datasets/
<script src="https://sevvandi.netlify.com/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>Finding anomalies/outliers in data is a task that is increasingly getting more attention mainly due to the variety of applications involved. Not that it is a new field of research. Rather, outlier/anomaly detection has been studied by statisticians and computer scientists for a long time. However, there are certain aspects which lack consensus.</p>
<ol style="list-style-type: decimal">
<li><strong>What’s in a name?</strong></li>
</ol>
<p>The words outlier/anomaly/novelty/extremes are sometimes used to describe the same thing; sometimes for slightly different things. So, when we read a paper, we need to be careful what the word means, because it may not mean what we think it means.</p>
<ol start="2" style="list-style-type: decimal">
<li><strong>Lack of definitions</strong></li>
</ol>
<p>Antony Unwin in his paper <em>Multivariate outliers and the O3 plot</em> says <em>“Outliers are a complicated business. It is difficult to define what they are, it is difficult to identify them, and it is difficult to assess how they affect analyses”.</em> Sometimes we find that the definition of an outlier depends on the application. For example, chromosomal anomalies in tumours may have very different distributional properties when compared with fraudulent credit card transactions among billions of legitimate transactions. This makes it difficult if not impossible for researchers to come up with algorithms that detect anomalies in all situations. Often, parameter selection plays a role. For example, if we are using an anomaly detection method based on knn distances, frequently we’re faced with the question “which k works best?”.</p>
<ol start="3" style="list-style-type: decimal">
<li><strong>Identify outliers or give scores?</strong></li>
</ol>
<p>Researchers broadly detect outliers in two different ways. (1) Identify outliers, i.e. declare if a data point is an outlier. This is a binary declaration. (2) Give a score of outlyingness. With this method, each point in the dataset gets a score of outlyingness. Both ways have pros and cons. The binary identification of outliers does not tell you how outlying it is. With this technique we have no sense which point is the most outlying of the identified outliers. On the other hand, the scoring method does not tell you which points are actually outliers. That is left to the user. The user can define a threshold and say that points with outlying scores above that threshold are outliers. However, coming up with a meaningful threshold may not be an easy task.</p>
<ol start="4" style="list-style-type: decimal">
<li><strong>Evaluation methods</strong></li>
</ol>
<p>Again, there is no consensus on how to evaluate anomaly detection methods. This is not surprising given that outlier detection methods have two modes of operation: identify outliers or give scores. Methods that identify outliers evaluate the anomaly detection method by using metrics such as true positive rate, false positive rate, positive predictive power and negative predictive power. On the other hand, anomaly detection methods that give scores use area under the Receiver Operator Characteristic (ROC) curve, or area under the Precision Recall curve to evaluate effectiveness.</p>
<ol start="5" style="list-style-type: decimal">
<li><strong>Datasets, or the lack there of</strong>.</li>
</ol>
<p>I believe it is fair to say that until recently there were no benchmark datasets for anomaly detection or at least little was known about such datasets. In the recent past there has been some attempt to fill this gap and now there are some repositories of datasets specially prepared for anomaly detection.</p>
Early classification of spatio-temporal events using partial information
https://sevvandi.netlify.com/publication/2019-07-14_early_classification/
Tue, 14 Jul 2020 00:00:00 +0000https://sevvandi.netlify.com/publication/2019-07-14_early_classification/Gael Martin's podcast on Bayesian Statistics
https://sevvandi.netlify.com/podcasts/2020-06-30-gaels-podcast/
Tue, 30 Jun 2020 00:00:00 +0000https://sevvandi.netlify.com/podcasts/2020-06-30-gaels-podcast/
<script src="https://sevvandi.netlify.com/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p><img src="https://sevvandi.netlify.com/img/Gael_pic.jpg" />
During the 2020 working-from-home period <a href="https://acems.org.au/our-people/tim-macuga">Tim Macuga</a> from ACEMS and I interviewed <a href="http://users.monash.edu.au/~gmartin/">Prof Gael Martin</a> from Monash University via zoom and made a podcast. This is available at <a href="https://acems.org.au/podcast/episode-32-bayes-theorem" class="uri">https://acems.org.au/podcast/episode-32-bayes-theorem</a>.</p>
<p>Gael gave a brief history of the Bayes Theorem and explained how it became so powerful a tool in today’s computational world. She gave examples and explained some theoretical stuff too. Pretty cool!</p>
<p>In this episode she also explains the differences between Bayesian and Frequentist paradigms and discusses how Bayesian computation developed over the years.</p>
Rob Hyndman's podcast on forecasting
https://sevvandi.netlify.com/podcasts/2020-05-30-robs-podcast/
Sat, 30 May 2020 00:00:00 +0000https://sevvandi.netlify.com/podcasts/2020-05-30-robs-podcast/
<script src="https://sevvandi.netlify.com/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>Are you interested in forecasting? Well, <a href="https://acems.org.au/our-people/anthony-mays">Anthony Mays</a> and myself did a podcast with <a href="https://robjhyndman.com/">Prof Rob Hyndman</a> from Monash University on this topic. This episode is a part of the ACEMS podcast series <em>Random Sample</em> and can be found at <a href="https://acems.org.au/podcast/episode-29-forecasting-the-future" class="uri">https://acems.org.au/podcast/episode-29-forecasting-the-future</a></p>
<p>In this episode, Rob explains time series forecasting really well. He gives such interesting anecdotes from his vast experience. I really enjoyed conducting this podcast, so much so that I forgot to take a photo. That is my only regret.</p>
Comprehensive Algorithm Portfolio Evaluation using Item Response Theory
https://sevvandi.netlify.com/preprint/2020-05-05_evaluating_algorithm/
Tue, 05 May 2020 00:00:00 +0000https://sevvandi.netlify.com/preprint/2020-05-05_evaluating_algorithm/Patricia Menéndez chats about Antarctica
https://sevvandi.netlify.com/podcasts/2020-04-30-patricia-s-podcast/
Thu, 30 Apr 2020 00:00:00 +0000https://sevvandi.netlify.com/podcasts/2020-04-30-patricia-s-podcast/
<script src="https://sevvandi.netlify.com/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p><img src="https://sevvandi.netlify.com/img/Patricia_Podcast.jpg" />
Before the 2020 self-isolation period (also known as the lockdown) <a href="https://acems.org.au/our-people/tim-macuga">Tim Macuga</a> from ACEMS and I informally interviewed <a href="https://www.patriciamenendez.com/">Dr Patricia Menéndez</a> from Monash University and made a podcast. Even though it was an informal chat, it was pretty special because it was about Patricia’s <a href="https://homewardboundprojects.com.au/about/">Homeward Bound</a> trip to Antarctica.</p>
<p>I felt she took us there with her evocative expressions. This podcast is part of the ACEMS podcast series <em>Random Sample</em> and can be found at <a href="https://acems.org.au/podcast/episode-24-Antarctic-Outreach" class="uri">https://acems.org.au/podcast/episode-24-Antarctic-Outreach</a> where there are lovely pictures of Antarctica. Don’t miss it.</p>
Tips for increasing your happiness during self-isolation
https://sevvandi.netlify.com/post/2020-04-13-tips-for-increasing-your-happiness/
Mon, 13 Apr 2020 00:00:00 +0000https://sevvandi.netlify.com/post/2020-04-13-tips-for-increasing-your-happiness/
<script src="https://sevvandi.netlify.com/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>Teaching first year students made me realize how difficult it is for young people to stay at home all the time, specially during this COVID-19 crisis. My understanding is that it makes people frustrated and unhappy. How can we support them? Are there tips that can help? Here is a list of things that I think will help to increase your sense of well being. Give it a try! <img src="featured.JPG" alt="Alt text" /></p>
Singularities of axially symmetric volume preserving mean curvature flow
https://sevvandi.netlify.com/publication/2020-01-01_singularities_of_axi/
Sun, 01 Mar 2020 00:00:00 +0000https://sevvandi.netlify.com/publication/2020-01-01_singularities_of_axi/airt
https://sevvandi.netlify.com/software/airt/
Sun, 05 Jan 2020 21:00:00 +0000https://sevvandi.netlify.com/software/airt/On normalization and algorithm selection for unsupervised outlier detection
https://sevvandi.netlify.com/publication/2020-01-01_on_normalization_and/
Wed, 01 Jan 2020 00:00:00 +0000https://sevvandi.netlify.com/publication/2020-01-01_on_normalization_and/DOBIN - Dimension Reduction for Outlier Detection
https://sevvandi.netlify.com/talk/2019_wombat/
Thu, 28 Nov 2019 15:30:00 +0000https://sevvandi.netlify.com/talk/2019_wombat/eventstream
https://sevvandi.netlify.com/software/eventstream/
Sun, 24 Nov 2019 00:00:00 +0000https://sevvandi.netlify.com/software/eventstream/Using dobin for time series data
https://sevvandi.netlify.com/post/dobin-for-time-series/
Sat, 16 Nov 2019 00:00:00 +0000https://sevvandi.netlify.com/post/dobin-for-time-series/
<p>The R package <em>dobin</em> can be used as a dimension reduction tool for outlier detection. So, if we have a dataset of <span class="math inline">\(N\)</span> independent observations, where each observation is of dimension <span class="math inline">\(p\)</span>, <em>dobin</em> can be used to find a new basis, such that the outliers of this dataset are highlighted using fewer basis vectors (see <a href="https://sevvandi.github.io/dobin/index.html">here</a>).</p>
<p>But, how do we use <em>dobin</em> for time series data? <em>Dobin</em> is not meant for raw time series data because it is time-dependent. But we can break a time series into consecutive non-overlapping windows and compute features of data in each window using an R package such as <a href="https://pkg.robjhyndman.com/tsfeatures/"><em>tsfeatures</em></a>. If we compute <span class="math inline">\(d\)</span> features, then data in each time series window will be denoted by a point in <span class="math inline">\(\mathbb{R}^d\)</span>.</p>
<div id="a-synthetic-example" class="section level2">
<h2>A Synthetic Example</h2>
<p>Let’s look at an example. We make a normally distributed time series of length <span class="math inline">\(6000\)</span> and insert an outlier at the position <span class="math inline">\(1010\)</span>.</p>
<pre class="r"><code>knitr::opts_chunk$set(cache=TRUE)
library(tsfeatures)
library(dplyr)
library(dobin)
library(ggplot2)
set.seed(1)
# Generate 6000 random normally distributed points for a time series
y <- rnorm(6000)
# Insert an additive outlier at position 1010
y[1010] <- 6
df <- cbind.data.frame(1:6000, y)
colnames(df) <- c("Index", "Value")
ggplot(df, aes(Index, Value)) + geom_point() + theme_bw()</code></pre>
<p><img src="https://sevvandi.netlify.com/post/2019-11-16-using-dobin-for-time-series/index_files/figure-html/setup-1.png" width="672" /></p>
<p>Now, let us break the time series into non-overlapping chunks of length <span class="math inline">\(50\)</span>, i.e. we get <span class="math inline">\(120\)</span> chunks or windows. Why do we use non-overlapping windows? If we use overlapping windows, say sliding by <span class="math inline">\(1\)</span>, the outlying point in the time series contributes to <span class="math inline">\(50\)</span> windows. Later, when we compute features of these time series windows, these <span class="math inline">\(50\)</span> windows will have similar features, but they will not be outliers in the feature space, because there are <span class="math inline">\(50\)</span> of them. That is why we use non-overlapping windows.</p>
<p>Also, note that we need the time series to have a decent length to compute features. For each window we compute time series features using <em>tsfeatures</em>.</p>
<pre class="r"><code># Split the time series into windows of length 50
my_data_list <- split(y, rep(1:120, each = 50))
# Compute features of each chunk using tsfeatues
ftrs <- tsfeatures(my_data_list)
head(ftrs)</code></pre>
<pre><code>## # A tibble: 6 x 16
## frequency nperiods seasonal_period trend spike linearity curvature
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 1 0.0506 1.01e-3 0.354 0.212
## 2 1 0 1 0.110 6.68e-4 -0.500 0.0679
## 3 1 0 1 0.201 8.10e-4 -2.18 -0.836
## 4 1 0 1 0.129 5.11e-4 -0.402 -1.57
## 5 1 0 1 0.134 7.74e-4 -0.817 1.39
## 6 1 0 1 0.0673 1.06e-3 0.130 0.681
## # ... with 9 more variables: e_acf1 <dbl>, e_acf10 <dbl>, entropy <dbl>,
## # x_acf1 <dbl>, x_acf10 <dbl>, diff1_acf1 <dbl>, diff1_acf10 <dbl>,
## # diff2_acf1 <dbl>, diff2_acf10 <dbl></code></pre>
<p>It is easier to find a good set of basis vectors that highlight outliers when there are a lot more points compared to the dimensions of the dataset, i.e. <span class="math inline">\(N > p\)</span>. In this case the feature space is <span class="math inline">\(16\)</span> dimensional, and we have <span class="math inline">\(120\)</span> points, each point corresponding to a window of the time seires.</p>
<p>Next we input these time series features to <em>dobin</em>.</p>
<pre class="r"><code>ftrs %>% dobin(norm=2) -> out
coords <- as.data.frame(out$coords[ ,1:2])
colnames(coords) <- c("DC1", "DC2")
ggplot(coords, aes(DC1, DC2)) + geom_point() + theme_bw()</code></pre>
<p><img src="https://sevvandi.netlify.com/post/2019-11-16-using-dobin-for-time-series/index_files/figure-html/dobin-1.png" width="672" /> In the first and second dobin component space (DC1-DC2 space), we see a point appearing far away near <span class="math inline">\((15, -5)\)</span>. Let’s investigate this point.</p>
<pre class="r"><code>inds <- which(coords[ ,1] > 10)
inds</code></pre>
<pre><code>## [1] 21</code></pre>
<p>OK, this point is coming from window 21. Also, this point deviates in the DC1 axis. So, let us look at the first dobin vector.</p>
<pre class="r"><code># First dobin vector
out$vec[ ,1]</code></pre>
<pre><code>## [1] 0.00000000 0.00000000 0.00000000 0.12507580 0.91723338
## [6] 0.10686900 0.12483596 0.08128369 0.20790487 -0.08597682
## [11] 0.06804500 0.17399103 0.05037166 0.08260081 -0.06594736
## [16] 0.10098625</code></pre>
<pre class="r"><code>colnames(ftrs)</code></pre>
<pre><code>## [1] "frequency" "nperiods" "seasonal_period"
## [4] "trend" "spike" "linearity"
## [7] "curvature" "e_acf1" "e_acf10"
## [10] "entropy" "x_acf1" "x_acf10"
## [13] "diff1_acf1" "diff1_acf10" "diff2_acf1"
## [16] "diff2_acf10"</code></pre>
<p>The first vector has a high value in <strong>spike</strong> (0.9172334), which measures the amount of spikiness in a time series. Now, let’s have a look at the 21st window of the time series.</p>
<pre class="r"><code># Make a dataframe from window 21
df2 <- cbind.data.frame((1000 + 1:50), my_data_list[[inds]])
colnames(df2) <- c("Index", "Value")
ggplot(df2, aes(Index, Value)) + geom_point() + geom_line() + theme_bw()</code></pre>
<p><img src="https://sevvandi.netlify.com/post/2019-11-16-using-dobin-for-time-series/index_files/figure-html/analysis3-1.png" width="672" /> We see that we’ve picked up the spike corresponding to position <span class="math inline">\(1010\)</span>, in the 21st window, because <span class="math inline">\(1010/50 = 20.2\)</span>.</p>
</div>
<div id="a-real-example" class="section level2">
<h2>A Real Example</h2>
<p>Next we look at a real world example containing the streamflow from Mad River near Springfield, Ohio from 1915- 1960.</p>
<pre class="r"><code>library(fpp)
library(ggplot2)
library(tsfeatures)
library(dobin)
library(tsdl)
tt <- tsdl[[77]]
autoplot(tt) + ggtitle("Mad River near Springfield OH 1915- 1960") +
xlab("Year") + ylab("Streamflow")</code></pre>
<p><img src="https://sevvandi.netlify.com/post/2019-11-16-using-dobin-for-time-series/index_files/figure-html/realEx-1.png" width="672" /></p>
<p>Let’s split the time series into non-overlapping windows and compute features as before.</p>
<pre class="r"><code>my_data_list <- split(tt, rep(1:23, each = 24))
# Compute features of each chunk using tsfeatues
ftrs <- tsfeatures(my_data_list)
ftrs[ ,4:7] %>% dobin() -> out
coords <- as.data.frame(out$coords[ ,1:2])
colnames(coords) <- c("DC1", "DC2")
ggplot(coords, aes(DC1, DC2)) + geom_point(size=2) + theme_bw()</code></pre>
<p><img src="https://sevvandi.netlify.com/post/2019-11-16-using-dobin-for-time-series/index_files/figure-html/feat2-1.png" width="672" /> We see a point having a DC1 value greater than 1. Let us investigate that point.</p>
<pre class="r"><code>ind <- which(coords[ ,1] > 1)
ind</code></pre>
<pre><code>## [1] 12</code></pre>
<pre class="r"><code>df <- cbind.data.frame((11*24+1):(12*24), my_data_list[[ind]])
colnames(df) <- c("Index", "Streamflow")
ggplot(df, aes(Index, Streamflow)) + geom_point() + geom_line()</code></pre>
<p><img src="https://sevvandi.netlify.com/post/2019-11-16-using-dobin-for-time-series/index_files/figure-html/dobin2-1.png" width="672" /></p>
<p>We see this point corresponds to the window with the highest spike in the time series, as this is the only spike greater than 75 units.</p>
<p>So, in summary <em>dobin</em> can be used as a dimension reduction technique for outlier detection for time series data, as long as the data is prepared appropriately.</p>
</div>
Space junk podcast
https://sevvandi.netlify.com/podcasts/space-junk/
Sun, 10 Nov 2019 00:00:00 +0000https://sevvandi.netlify.com/podcasts/space-junk/<p>We humans leave such a lot of junk in space. If you’re interested in space junk, you can listen to two podcasts that I co-conducted with <a href="https://acems.org.au/our-people/anthony-mays" target="_blank">Anthony Mays</a> on the topic. These episodes were part of the ACEMS podcast series <em>Random Sample</em> and can be found at <a href="https://acems.org.au/podcast/episodes12-13-space-junk-shield-tech" target="_blank">https://acems.org.au/podcast/episodes12-13-space-junk-shield-tech</a>.</p>
<p>In the first episode, we talk about space junk and one of the related research papers that I wrote with <a href="https://www.dst.defence.gov.au/staff/dr-shannon-ryan" target="_blank">Dr Shannon Ryan</a> from DST Group and <a href="https://katesmithmiles.wixsite.com/home" target="_blank">Prof Kate Smith-Miles</a>.</p>
<p>In the second episode we interview Shannon and <a href="https://people.mst.edu/faculty/wschon/" target="_blank">Prof Bill Schonberg</a>, two space junk experts about space craft shields and the lastest technology.</p>
dobin
https://sevvandi.netlify.com/software/dobin/
Sat, 02 Nov 2019 14:42:00 +0000https://sevvandi.netlify.com/software/dobin/Early event classification in spatio-temporal data streams
https://sevvandi.netlify.com/talk/2019_isf/
Tue, 18 Jun 2019 14:30:00 +0000https://sevvandi.netlify.com/talk/2019_isf/Instance space analysis for outlier detection
https://sevvandi.netlify.com/talk/2019_edml/
Sat, 04 May 2019 16:00:00 +0000https://sevvandi.netlify.com/talk/2019_edml/A framework for automated anomaly detection in high frequency water-quality data from in situ sensors
https://sevvandi.netlify.com/publication/2019-01-01_a_framework_for_auto/
Tue, 01 Jan 2019 00:00:00 +0000https://sevvandi.netlify.com/publication/2019-01-01_a_framework_for_auto/Anomaly detection in streaming nonstationary temporal data
https://sevvandi.netlify.com/publication/2019-01-01_anomaly_detection_in/
Tue, 01 Jan 2019 00:00:00 +0000https://sevvandi.netlify.com/publication/2019-01-01_anomaly_detection_in/Event detection in spatio-temporal data using a Bayesian segmented ARMA change-point model
https://sevvandi.netlify.com/preprint/2019-01-01_event_detection_in_s/
Tue, 01 Jan 2019 00:00:00 +0000https://sevvandi.netlify.com/preprint/2019-01-01_event_detection_in_s/Instance Space Analysis for Unsupervised Outlier Detection
https://sevvandi.netlify.com/publication/2019-01-01_instance_space_analy/
Tue, 01 Jan 2019 00:00:00 +0000https://sevvandi.netlify.com/publication/2019-01-01_instance_space_analy/Predicting sediment and nutrient concentrations from high-frequency water-quality data
https://sevvandi.netlify.com/publication/2019-01-01_predicting_sediment_/
Tue, 01 Jan 2019 00:00:00 +0000https://sevvandi.netlify.com/publication/2019-01-01_predicting_sediment_/Singularity formation in axially symmetric mean curvature flow with Neumann boundary
https://sevvandi.netlify.com/preprint/2019-01-01_singularity_formatio/
Tue, 01 Jan 2019 00:00:00 +0000https://sevvandi.netlify.com/preprint/2019-01-01_singularity_formatio/Does normalizing your data affect outlier detection?
https://sevvandi.netlify.com/talk/2018_user/
Wed, 11 Jul 2018 16:10:00 +0000https://sevvandi.netlify.com/talk/2018_user/On the extension of axially symmetric volume preserving mean curvature flow
https://sevvandi.netlify.com/preprint/2017-01-01_on_the_extension_of_/
Sun, 01 Jan 2017 00:00:00 +0000https://sevvandi.netlify.com/preprint/2017-01-01_on_the_extension_of_/Machine learning methods for predicting the outcome of hypervelocity impact events
https://sevvandi.netlify.com/publication/2016-01-01_machine_learning_met/
Fri, 01 Jan 2016 00:00:00 +0000https://sevvandi.netlify.com/publication/2016-01-01_machine_learning_met/Support vector machines for characterizing Whipple shield performance
https://sevvandi.netlify.com/publication/2015-01-01_support_vector_machi/
Thu, 01 Jan 2015 00:00:00 +0000https://sevvandi.netlify.com/publication/2015-01-01_support_vector_machi/On the convergence of axially symmetric volume preserving mean curvature flow
https://sevvandi.netlify.com/publication/2012-01-01_on_the_convergence_o/
Sun, 01 Jan 2012 00:00:00 +0000https://sevvandi.netlify.com/publication/2012-01-01_on_the_convergence_o/