UvA-DARE (Digital Academic Repository) Detecting dispersed radio transients in real time using convolutional neural networks Detecting dispersed radio transients in real time using convolutional neural networks ✩

We present a methodology for automated real-time analysis of a radio image data stream with the goal to find transient sources. Contrary to previous works, the transients we are interested in occur on a time-scale where dispersion starts to play a role, so we must search a higher-dimensional data space and yet work fast enough to keep up with the data stream in real time. The approach consists of five main steps: quality control, source detection, association, flux measurement, and physical parameter inference. We present parallelized methods based on convolutions and filters that can be accelerated on a GPU, allowing the pipeline to run in real-time. In the parameter inference step, we apply a convolutional neural network to dynamic spectra that were obtained from the preceding steps. It infers physical parameters, among which the dispersion measure of the transient candidate. Based on critical values of these parameters, an alert can be sent out and data will be saved for further investigation. Experimentally, the pipeline is applied to simulated data and images from AARTFAAC (Amsterdam Astron Radio Transients Facility And Analysis Centre), a transients facility based on the Low-Frequency Array (LOFAR). Results on simulated data show the efficacy of the pipeline, and from real data it discovered dispersed pulses. The current work targets transients on time scales that are longer than the fast transients of beam-formed search, but shorter than slow transients in which dispersion matters less. This fills a methodological gap that is relevant for the upcoming Square-Kilometer Array (SKA). Additionally, since real-time analysis can be performed, only data with promising detections can be saved to disk, providing a solution to the big-data problem that modern astronomy is dealing with.

Much work has been done on detecting fast radio transients (Cordes and McLaughlin, 2003;Lorimer et al., 2013;Coenen et al., 2014;Amiri et al., 2018) that occur on millisecond time scales. This is usually realized using beamforming (Lorimer and Kramer, 2012), having high time and frequency resolution at the cost of ✩ This code is registered at the ASCL with the code entry ascl:2021.064. poor spatial resolution. The arguably foremost example of such fast transients is the fast radio burst (Petroff et al., 2019). One of the key features of fast radio transients is the dispersion of the observed emission in time and frequency, in which emission at lower radio frequencies arrive later in time than the emission at higher radio frequencies (Taylor and Cordes, 1993). Each source has a characteristic Dispersion Measure (DM); more distant sources have a higher DM value corresponding to a larger delay between receiving the high and low frequency emission.
In between these two extremes, we enter a domain where we investigate images of a large field of view at a relatively high time resolution. The high time-resolution means that the transients (of intermediate length, lasting seconds to minutes) can be significantly dispersed, especially at low radio frequencies. Consequently, we also need sufficient frequency resolution. Altogether, analysis is to be performed in the spatial, temporal, and frequency x-axis. The instrument (AARTFAAC) observed at two separate ranges of consecutive bands, resulting in the empty central gap. The parametersθ inferred by the convolutional neural network are provided above the respective samples. Additionally, we plot the dispersion sweep according to the inferred dispersion measure DM. The upper-right example is the proposed transient candidate by Kuiack et al. (2020b). The other three are yet to be confirmed new transient candidates.
domains. This is a higher-dimensional search space than the previously introduced methods. On the one hand beamformed search only considers the time and frequency domains. On the other hand, slow transient search considers mostly the spatial and time domains. Kuiack et al. (2020a,c) show such intermediate length transient detections in this type of data. These were achieved in offline analysis, causing a long latency between a transient's occurrence and its discovery. If we can find them in real time, we enable follow-up studies before they have faded. This greatly enhances the scientific return, making a strong case for further development of structured search methods. We note, of course, that imaging data usually have much lower time and frequency resolution than beam-formed data, so while we will search a higher-dimensional data space, we do not necessarily search a larger data volume.
At every time step, we obtain an image cube containing a frequency dimension and two spatial dimensions. Analysis pipelines for these have been proposed before by e.g. Swinbank et al. (2015). However, these works are not yet scalable enough to perform blind transient searches in the enormous volumes of data. Additionally, there have been targeted searches for specific types of highly dispersed sources in high time and frequency resolution data obtained using the Murchison Widefield Array (MWA; Tingay et al., 2013). Tingay et al. (2015) piloted a search for dispersed fast radio bursts in high time and frequency resolution imaging data covering a 400 square degree field of view obtained using the MWA. However, processing just 2 h of data required 3 days on a single processing core, making this computationally inefficient and far from attaining real-time analysis. Searches using the MWA have also focused on highly dispersed sources expected to be detected in just a single pixel in the radio image, enabling a significant speed up in processing time but at the sacrifice of not searching the full field of view (e.g. Anderson et al., 2021). Thus, there is a need for a computationally efficient method to search for dispersed radio transients in wide field of view imaging observations. We thereby can blindly search for the brightest and rarest transients, filtering only the most useful information and alerting the multi-wavelength transient community when required. This means that the enormous number of spurious candidates from radio frequency interference (RFI), scintillating sources, and random noise have to be filtered automatically.
As an answer to the aforementioned, we develop a pipeline that considers the spatial, time, and frequency domain simultaneously and can detect dispersed transients of intermediate length.
It scales to real-time analysis by its ability to run on GPUs. Five sequential steps are performed: quality control, source detection, source association, flux measurement, and physical parameter inference. Motivated before, we want to keep a high-frequency resolution and therefore do not integrate the frequency bands. This allows for new processing approaches. We can control the quality of the images (step 1) by comparing them to the other bands. Next, it allows for source detection (step 2) in sub-bands independently. Furthermore, we measure the source flux (step 4) in separate bands. We propose methods for source detection and measurement using convolutions and filters, which are easily parallelizable. Since we perform the processing on the subbands individually, we can directly construct dynamic spectra from which we infer physical parameters of a potential transient candidate (step 5). We apply a convolutional neural network machine learning approach to do so. Particularly the DM is of interest in separating spurious from real transients. By doing so, we rethink the challenge of detecting dispersed transients by using dispersion directly to discard spurious candidates.
We test our approaches on simulated and real data from AART-FAAC (Amsterdam Astron Radio Transients Facility And Analysis Centre; Prasad et al., 2016), a real-time transients facility based on the Low-Frequency Array (LOFAR; van Haarlem et al., 2013). However, our methods could also be implemented by MWA, Long Wavelength Array Station 1 (LWA1; Obenberger et al., 2014), Owens Valley Radio Observatory Long Wavelength Array (OVRO-LWA; Anderson et al., 2019) LOFAR, and the future Square-Kilometer Array (SKA; Carilli and Rawlings, 2004). The results show that the detection methods can reliably find transient candidates, and the neural network discriminates spurious candidates from promising ones using reliable uncertainty bounds. As shown in Fig. 1, interesting bursts are uncovered, among which the candidate proposed by Kuiack et al. (2020b). Moreover, the pipeline can perform these steps in real time, allowing for online selection of data to be saved to disk for follow-up investigation.
Our scientific contributions can be summarized as follows: • We propose an end-to-end GPU-accelerated pipeline that can take streaming multi-frequency image data and output alerts in real time. It contains source detection, tracking (i.e., association) and analysis. 1 Our method is the first that considers the spatial, time, and frequency domains of the incoming data simultaneously. This allows for transient hunting on intermediate time scales.
• We propose source detection and measurement methods based on convolutions and filters.
• A neural-network-based analysis approach, in which physical parameters are inferred directly from dynamic spectra.
The paper is organized as follows. In Section 2 we describe the end-to-end pipeline and its methodologies. In Section 3 we report results of experiments that were done to test the pipeline and some preliminary data products extracted from application to observations. Finally, we conclude in Section 4 and foresee some interesting directions for further research and development.

Methodology
An overview of our transients pipeline is given in Fig. 2. In this section, we discuss some of the approaches. Since it is somewhat specific to our instrument, the methods for quality control of the input data can be found in Appendix A.

Source detection
The input at time t to our pipeline is an image cube X t ∈ R B×D×D , where B is the number of channels and D the image size. We analyze images at multiple bandpasses in parallel. This is done since astronomical transients are expected to be dispersed over these bandpasses. By doing so, the probability of a false negative (FN) goes down with B as where Φ is the standard cumulative density function, κ userdefined and s/n the signal-to-noise ratio of the source. When perfectly correcting for dispersion, one usually gets a factor √ B (times s/n) decrease. However, applying brute force coherent dedispersion before searching is practically unattainable in real time in image space for all sources. Still, Eq. (1) is always better than not or even wrongly using the bandpasses. Aggregating the bandpasses (as in e.g. Anderson et al., 2019) without correcting for dispersion would increase the probability of a false negative with √ B. The above is illustrated in Fig. 3. Derivations and further discussion are presented in Appendix B. Furthermore, we exemplify the consequences of this result in Appendix E, where we show that for several facilities (MWA, LOFAR, LWA1, OVRO-LWA), integration yields a lower likelihood of finding a dispersed transient.

Peak detection
Consider a single image X ∈ R D×D (time-index omitted) from X t . Previous approaches (e.g. Swinbank et al., 2015;Spreeuw et al., 2018) divide the image to be analyzed into a grid. In every grid cell, sigma-clipping is performed and thus local statistics False-negative rate as a function of the number of bandpasses B (lower is better). As a reference, we show the log-probability for a single pass. When not or incorrectly adjusting for dispersion, the probability of a false negative goes up. For the given signal-to-noise-ratio s/n and κ, the optimal case (but practically unattainable) is averaging after correctly adjusting for dispersion. Independently searching also gives significant false negative error improvement as B grows.
are used for the detection threshold. We replace this approach with a method that combines sigma-clipping with convolutions. When dividing the image into a grid, an isolated faint source might not be detected when it is at the edge of a grid patch that is crowded with bright sources (or radio-frequency interference). The bright objects first have to be clipped away for the source to be found. This is usually countered (naively) by interpolating the grid. Alternatively, one might want to sigmaclip using overlapping grid cells, lowering the risk that a faint source at an unfortunate position might not be picked up on. However, taken to the extreme case this becomes a convolution where we use grid statistics at every pixel coordinate. By the convolution theorem, this can effectively be implemented using Fourier transforms and accelerated on GPUs. Let be a Gaussian kernel with Z := ∑ n+k ] its normalizing constant. Alternatively, one could also go for a (circular) uniform kernel. s is an important parameter and should be set such that the local background statistics are sufficiently included, without incorporating too much source flux. Heuristics for setting it are e.g. widening the Gaussian such that its full width at half maximum (FWHM) includes most primary sidelobes, or running an optimization scheme such that it retrieves as many as possible known sources from a reference catalog. The following is run for multiple iterations. At each iteration, we have and where the square is applied element-wise. C nm and S nm are center and spread estimates of the noise computed at every location nm.
We clip values in the following manner: with κ as a parameter that specifies how much signal is required for a detection to be made. This procedure is repeated until no new y nm is discovered or for a pre-specified number of iterations. Since we have accurate noise estimates at every nm, not many iterations are needed to find most of the true positives (Appendix C). The result of the iterative sigma-clipping done is a binary image Y ∈ {0, 1} D×D indicating the source locations.

Peak localization
The sources in Y usually are extended (i.e., they span multiple connected locations). In order to isolate them to a single coordinate, we apply a maximum filter to X and compute Q ∈ {0, 1} D×D indicating if the pixels are the local maximum: with k = 3. Then, we obtain the peak locations P ∈ {0, 1} D×D by the following boolean operation: Intuitively, we only include pixels that are both the local maximum and above the detection threshold, giving us the exact location of the peaks. The advantage of doing so (next to parallelization) is that this automatically deblends detected sources. I.e., it separates an ''island'' of flux that exceeds the local noise level into distinct sources. Obtaining P for all B bandpasses independently gives us a cube P ∈ {0, 1} B×D×D with source locations.
This concludes the source detection & localization pipeline. We have presented new methods based on convolutions and kernels. They are elegant in that they do not require a discontinuous grid to be placed over the image. Additionally, they are fully parallel and accelerated on GPUs. That is, all operations are performed concurrently on separate spatial or channel locations.

Source association & flux measurement
At time-step t after we obtain a list of sources from the source detection pipeline, we filter the duplicates that we obtain from measuring at different frequencies simultaneously. Next, we match the catalog's previous time-step t −1 sources based on the 2D distance (in degrees). Sources are matched if they are within a pre-specified association distance limit (e.g., 1 degrees). After initial detection, we keep taking measurements (i.e., monitor the source) until it has not been detected for at least a prespecified number of time-steps. This is to obtain data products that are not too sparse for analysis. We concatenate the detected sources with the sources that are to be monitored. Next, we take measurements of the fluxes of detected and monitored sources at all frequencies. This is done by taking the maximum pixel value within a box around the source peak. In the interest of time, we are not fitting and integrating Gaussians to the detected sources. We add the measured peak flux to the database for every monitored or newly detected source. Finally, we backward fill the catalog with ''null detections'' (newly detected sources). That is, if some potential transient is detected in time-step t, we also want to include the ''build-up'' into the data products. Thus, we add a pre-specified number of timesteps to the catalog and measure from cached images.

Neural network-based parameter inference
Our quality control and source detection pipelines already filter out spurious candidates based on (local) noise statistics. However, in radio astronomy there exist many noise modes (e.g., RFI, satellites, airplanes) that will slip through as candidate transients. These signals are not dispersed due to their nearness. This is in contrast with astronomical transients, giving us a key feature to filter them. This was originally seen by Kuiack et al. (2020b), who heuristically searched for dispersed candidates. We take a more systematic approach and resort to machine learning. Neural networks in particular can process many data instances in parallel, especially on GPUs (LeCun et al., 2015). In this subsection, we elaborate on how we use inferred physical parameters to filter the spurious transient candidates. Afterward, we give an introduction to deep learning and convolutional neural networks (LeCun et al., 1995). Finally, we explain how these are used to obtain physical parameters.

Filtering candidates based on physical parameters
An astrophysical burst can be described using a range of physical parameters. For radio transients, these are among others overall shape, integrated and peak flux density, pulse width, dispersion measure, spectral information (such as index and bandwidth), and scattering. By obtaining these, we can quickly apply a filter and narrow follow-up investigation to only the most interesting cases. For example, any noise progenitor that is relatively close (e.g., atmospheric or human-made) will not be dispersed since the integrated electron density along the path is too small. If we can quickly obtain the dispersion measure we can exclude low-DM bursts. Next, we give an introduction to deep learning and how it is used to infer these parameters of interest.

Deep learning
We briefly discuss the set-up of the neural network For a more thorough explanation of how the neural network can be used on frequency-time plots, we refer the reader to e.g. Connor and van Leeuwen (2018). In its essence, a neural network is a regression preceded by a series of nonlinear transformations. The idea is that the network learns a mapping from input space to another space (''hidden state'') from which it completes a task (e.g., classification or regression). The projection into the hidden space allows a neural network to learn a representation. This representation reflects important features that were computed from the input data. Let x ∈ R d be an arbitrary data-point, w ∈ R d a vector of weights, b ∈ R a bias value. A hidden state is computed using a linear combination of the input with a set of weights, and a nonlinearity (referred to as ''activation function'') φ(·): In this work we use Leaky ReLU functions: with α a small positive value. Concatenating multiple such layers with a final task-specific layer forms a neural network. Since we are dealing with a regression problem here (inferring parameters from the dynamic spectra) our final layer is simply another linear combination but without an activation function. Convolutional neural networks (LeCun et al., 1995) are a type of neural network where the weight vector comes in the form of a filter (or kernel) with which the input image is convolved. This weight filter is then learned, extracting relevant features from the image. This can, like a traditional feed-forward network, be optimized with regular gradient descent and backpropagation (LeCun et al., 2015). The fact that only kernels (typically multiple per layer) are learned, makes this a lightweight network that is particularly effective for image processing. Applying the kernels in a convolution also has the advantage that the network output is translationally invariant. This is an important property since we apply the network to dynamic spectrums in which the burst can occur at multiple spatial locations.

Parameter inference
Let θ ∈ R d ′ be a parameter vector associated with a data-point x. In our case, x is a dynamic spectrum that we obtained from the image processing steps (see Fig. 2). We are interested in p(θ | x). Since θ is real-valued, a natural choice is to model p be the output of the neural network g(x, W), where W is a concatenation of the layer weights.L is a lower-triangular matrix with positive diagonal. Therefore, it is a Cholesky factor and Σ :=LL ⊤ obtains a valid covariance matrix. The neural network outputs thus parameterize p(θ | x). By outputting a variance, the network directly models the signal-to-noise ratio of a data point. This allows for filtering data points with too low signal-to-noise levels. W is obtained by training the network, which is done as follows. We have a dataset D = {(x k , θ k )} K k=1 of K independent data-points. We obtain the optimal network weights for parameterizing the conditional distribution p(θ k | x k , W) (i.e., we amortize the usage of W for parameterizing all K conditionals) using maximum likelihood.
with loss function where c is a constant that does not depend on W. Using minibatch gradient descent (LeCun et al., 2015), we iteratively adapt W to minimize this loss function. We use the Adam optimizer (Kingma and Ba, 2014) with default settings (e.g., learning rate) to We report the 1-sigma bands of 32 runs. Additionally, we supply F 90 scores (lower is better) for both. do so. Thereby, the neural network ''learns'' to predict the correct θ from an input x.
Consider Fig. 4 for an illustration fo the proposed methods.

Results
We conduct several experiments to assess the efficacy of our methods. The following are run on an Intel Xeon Gold 5118 with a maximum clock speed of 3.20 GHz and an NVIDIA Titan RTX GPU accelerator. Table 1 Quantitative results of the neural network prediction of the DM parameter. The column values are fluence bins. We report the mean absolute error (MAE), root-mean-square error (RMSE; lower is better) absolutely and relative to the dispersion measure. Additionally, we give the probabilities that the true value lies within 1, 2 or 3 times the predicted standard deviation.

Accuracy of source finding
To test the source-finder, we carry out experiments using single-bandpass all-sky images. The images are simulated. This has the advantage that we control entirely the noise characteristics and SNR of the sources. First, we generate a model that simulates large-scale inhomogeneities (extended emission) of the final image. For example, our Galactic foreground noise and Cygnus A & Cassiopeia A calibration remnants. Then, we add to this map standard, independent Gaussian noise. Following Vafaei Sadr et al. (2019), we sample flux values from an exponential distribution such that approximately 40% have a signal-to-noise ratio less than unity. We create point spread functions (PSF) by adding artificial side-lobes to a Gaussian main beam. More details and examples are shown in Appendix D.
We also take inspiration from (Vafaei Sadr et al., 2019) evaluating the source-finder. The reported signal-to-noise ratios (SNRs) (x-axis) are computed by dividing the ground-truth peak flux value over the image noise. We bin them and report the scores as a function of these bins. The scores are precision (purity), recall (completeness), and F1.
where TP and FP are the number of true positives and false positives, respectively. The F1 score computes a harmonic mean between precision (P) and recall (R). This is important as usually there is a trade-off between them. We also denote what we call the F 90 which is defined as the minimal signal to noise ratio (SNR) such that the F1 score is at least 0.9. We run the methods at κ = 2, which optimized the reported F 90 . For our method (coined ''LPF'' for Live Pulse Finder) we ran only a single sigmaclip iteration since this already recovers most of the true positives (see Appendix C). In practice, we run more sigma-clip iterations as this maximizes recall. In Fig. 5, we compare our results with PySE (Spreeuw et al., 2018) at various signal-to-noise ratios. At similar recall values, our method is a bit more precise at low signal-to-noise ratios but slightly less precise with high signalto-noise ratios. In terms of F1 scores and the methods perform similarly with an F 90 of 4.65. Note that PySE takes around 2 s to complete the analysis for one image (single band) on our system (see Section 3.5).

Parameter inference
As specified in , we need a dataset of input-output pairs to train our neural network. Since we have few true transient  candidates, we are very short on such data. Favorably, we know one of the most important properties of a true astrophysical transient: a bright, broadband, dispersed signal. The dispersion of the signal is manifested through the broadening of the pulse over a finite bandwidth. It originates from the interaction of emitted photons with electrons along the path between an observer and the source. The integrated electron column density, called the DM, is used as a proxy for the distance to the source. Using known equations, we can easily simulate dispersed astrophysical transients. The dispersion measure of non-astrophysical transients usually is close to 0 (i.e., no broadening of the pulse), making it possibly the most useful feature for separating signal from noise (Kuiack et al., 2020c). We build a dataset by injecting these simulated pulses into randomly sampled noise from the survey. In the following, the reported parameter values can be set according to the interests of the practitioner. They should be set such that the simulated dataset covers the population of interest sufficiently. First, we generate a Gaussian pulse profile using a width w sampled uniformly between 0 and 16, corresponding to a maximum FWHM of 37.67 time-steps.
The profile is computed as with C DM the dispersion constant e 2 pc 8π 2 ϵ 0 m e c · 10 −6 ≈ 4148.806 MHz 2 cm −3 ms (Lorimer and Kramer, 2012). In this analysis, we sample the DM uniformly between 0 and 512, covering a significant part of the known FRB population (Petroff et al., 2016). The final intensity is computed as where A is the burst amplitude in standard deviations of the noise. We experimented with A uniformly between 0 and 8 and α, the spectral index, sampled from (−4, 4) uniformly. We compute the intensity for all time-steps and frequencies and save the pulse at the corresponding band and time indices. Using this procedure, we create 32,768 samples (input time-frequency plots and output parameters pairs) to train the network. We train the convolutional neural network (CNN) on the generated dataset. The network was trained using early stopping (Le-Cun et al., 2015). That is, training was not canceled until the likelihood of the held-out validation data stopped increasing.
Quantitative results on the dispersion measure inference (arguably, the most important feature for discriminating spurious from real bursts) are shown in Table 1. The reported MAE (mean absolute error) and RMSE (root-mean-squared error) are satisfactory considering the range of values that the dispersion measure can take, which is confirmed by observing the relative MAE/DM and RMSE/DM. The uncertainty is relatively well-calibrated. That is, it is close to what one expects for a Gaussian distribution. Notably however, the network is overconfident for low-fluence bursts.

Testing the pipeline
We now test the efficacy of the entire pipeline. We simulate entire sky images consisting of both stable sources and (dispersed) transients. We sample the signal-to-noise ratios for both the transients and the stable sources using the same exponential distribution as in Section 3.1 (i.e., as suggested by Vafaei Sadr et al. (2019)). The dispersion measure was sampled from an exponential distribution with rate parameter 1 55 , such that roughly 40% of the transients have a dispersion measure DM > 50 (Fig. 6).
This is somewhat arbitrary and partly set such that there are sufficient dispersed signals, but can be motivated as follows: First, the majority of detections are expected to be lowly dispersed (atmospheric) noise. Additionally, the integrated flux within a passband is reduced due to temporal dispersion. We, therefore, do not expect many highly dispersed pulses to be detectable at all. Finally, at the long wavelengths of AARTFAAC (Prasad et al., 2016), extremely dispersed events are so smeared out that the slope would not be inferrable from the dynamic spectrum. For inference, the same neural network is used as in Section 3.2. Note that this network was trained on uniform parameter distributions, meaning that there is a discrepancy between the training data and the data that we test the model on. This replicates the distribution shift we would also expect in real data. Thus, we first inspect how the recovered population of transient candidates matches the ground truth one. Thereafter, we assess how many of the transients are recovered (as a function of fluence and dispersion measure). First observe Fig. 7, where we plot the density of modeled standard deviation for the dispersion measures (σ DM , found on the diagonal ofΣ k ) of the sources. We clearly see a bimodal distribution. The high uncertainty modes correspond to detections that were simply noise or stable sources. After removing the high uncertainty modeσ DM > 50 (regarding them false positives), we plot the resulting dispersion measure density in Fig. 8. Even though the model is trained on uniform distributions, the recovered predictions closely follow the expected density. Note that we do not expect the network to infer parameters that are extremely far from its training distribution (e.g., DM ≫ 512) correctly. However, the recovered distributions are not very much biased to the uniform distributions that the model was trained on, which is reassuring. In Fig. 9 we show the percentage of recovered transients as a function of signal-to-noise ratio for three different dispersion measure bins. As expected, the fraction of recovered transients approaches unity with increasing signal-to-noise ratios.

Application to real data
We applied the pipeline to a real AAARTFAAC-6 survey dataset.
We used κ = 4.5 to retrieve an acceptable number of candidate sources from the source detection pipeline. ''Acceptable'' here expresses a trade-off between precision and recall. By lowering κ, many more spurious candidates will be retrieved. The neural network will filter these, but the disadvantages are two-fold: more processing has to be done and it might start force-fitting at locations where you want to remain sensitive to actual candidates. After pushing the associated time-frequency data through the network, we obtain a vector of parameters {θ l } L l=1 for L recovered transient candidates. These can be filtered based on the scientist's needs. To obtain examples of interesting bursts presented in this study, we used the following. We threshold the inferred dispersion measureμ DM at 50. Next, we sorted the candidates according to their inferred dispersion measure standard deviationσ DM and inspected them top-down. By applying the entire pipeline to a real-time live survey we retrieved interesting candidates for follow-up analysis fully automatically. Examples are shown in Fig. 1. The above is exemplary but gives an idea of how a practitioner can filter the data based on inferred physical parameters. The proposed transient (Kuiack et al., 2020b) was recovered with an estimated dispersion measure of 74.49 ± 8.26 (Kuiack et al. (2020b) concluded a dispersion measure of 74 ± 5).

Scaling results
Finally, we report performance results as scalability was one of the research goals of this work. In Fig. 10 we see that we can process images in a fraction of the time that Swinbank et al. (2015) take. Importantly, we can process 16 × 1024 × 1024 cubes in real time, which is a science goal of this work. Also comparably, Pintaldi et al. (2021) report 924 images processed in 13 h. We see that the processing time as a function of the number of pixels scales sub-linearly. There is little overhead added by our convolutional source localization: the bulk of compute is required for catalog processing and image caching. Specifically, backward filling (see Section 2.2) takes 37% of these steps, and image caching takes 60%, increasing with image size. Smart I/O can solve this but is left for future work. Finally, inferring directly parameters using a neural network adds little overhead.

Conclusion
We presented new methods that allow for real-time analysis of all-sky radio image cubes on time scales where dispersion starts to play a role. In it, parallelized methods based on convolutions and filters that are accelerated on a GPU process the image stream. Afterward, a neural network is employed to infer physical parameters, among which the dispersion measure of the detected sources. Based on these, false positives are easily filtered. The methods were tested individually and as a whole on simulated data, as well as on real data from AARTFAAC, a LOFAR based transients facility. The results can competently recover simulated dispersed transients from the simulated data stream. In real AARTFAAC data dispersed signals were found, on which follow-up analysis can be performed. Scaling results showed that the entire pipeline can analyze image cubes upwards of 16 1024 × 1024 images in under 1 second per iteration. Thus, applying the method in real time can filter uninteresting data, providing a solution to the big-data problem that modern astronomy is dealing with. Concluding, the current work proposed effective and efficient methods to search for intermediate-length dispersed transients in radio image cubes, filling a methodological gap that is also relevant for MWA, LOFAR, LWA1, OVRO-LWA and the future SKA.

Future work
Based on the blind detections we performed so far, it is clear that just searching for broad-band dispersed signals is not enough. Scintillation seems to produce a significant fraction of the dispersed signals we are looking for, and we should find new ways to sift the distribution of these candidates from truly interesting events. Some of the other parameters (e.g. the width of the pulse) seem promising. Future research could further investigate.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.  has to produce a center and scale of the data to compare to. The telescope or imaging pipeline corruptions can last for extended periods and thus using temporal (moving) averages is futile, as the noisy data will be incorporated into the statistics. Since we analyze data at multiple passbands, we can instead use the prior knowledge that these corruptions are frequentially local and thus use the other passbands to form center and scale estimates. The corruptions usually are harsh, therefore we use robust statistics. Consider an image cube at time t (we omit the time index) X ∈ R B×D×D . We take the average value of the band images in a vector b := 1 ∑ D j=1 X :,i,j . We compute standardized robust scores where MAD is the Mean Absolute Deviation (e.g. Howell, 2005). A component z b of z that exceeds a threshold (of e.g. 5) results in the corresponding image X b being discarded. Discarding is done by imputing the entire image with zeros. The radio-frequency interference spikes illustrated in Fig. A.12 are usually short-lived but spatially and frequentially more local. Therefore, outliers cannot be detected in the averages of entire images. We, therefore, keep running estimatesX andS of the mean and standard deviation of the image stream using e.g. Welford (1962), Finch (2009. We compute pixel standardized scores and if ζ bij exceeds a threshold we discard the image X b in band b by imputing it with zeros. Note that the transients we are searching for could also exceed such a threshold. However, observing Fig. A.12 we note that the radio-frequency interference spikes are much harsher than an astronomical transient. Thus, we can set a ζ bij such that only the extreme outliers are caught.

Appendix B. Statistical analysis of source-finding
In the following, we perform a statistical analysis on how detecting sources in separate sub-bands individually improves our false negative error rates. We compare against averaging (with independent noise and constant signal) after applying perfect dedispersion. However, note that this is unattainable in practice: one would have to apply source-detection in images de-dispersed against all possible DM. This is computationally intractable. Secondly, the assumptions required are not realistic in practice. Alternatively, performing detection in the sub-bands individually still gives us a type 2 (false negative) error improvement rate while being computationally feasible.

B.1. Single subband
Consider a signal observed in a single passband. It is corrupted with Gaussian noise: with E ∼ N (0, n 2 ). We set a threshold that defines our type 1 error rate (i.e., false positives) at κn. Under type 1 error, we consider S = 0. That is, what is the probability that we conclude a detection when there is actually only noise. The probability of such a type 1 error is where Z := X /n. Next, we consider the type 2 error rate. This expresses the probability that we neglect actual sources (S = s > 0), i.e., false negatives. We have and so our defined κ and the signal-to-noise ratio of the source define our probability of neglecting it.

B.2. Perfect dedispersion
Consider a signal that we observe in multiple passbands. We average it after correcting perfectly for the dispersion delay. Assuming perfect independent Gaussianity with equal variances and equal signals in all channels, our signal-to-noise ratio improves with a factor B (the number of channels). Since we keep the false-negative rate fixed, defined by κn, our type 2 error improves. For S > 0 we have With B > 1 this is an improvement over the single-band case.

B.3. Wrong dedispersion
Often, subbands are averaged in order to improve signal-tonoise ratios. However, in the face of a dispersed signal, doing so can actually wash out the signal. In the extreme case, we only observe it in a single band. Integrating that out would be equal to consideringX which is equal to reducing the signal (or boosting the noise) with √ B.
Considering the same type 1 error κn then increases our type 2 error:

B.4. No dedispersion & independent detection
By doing detection in multiple subbands simultaneously, we are effectively ''throwing the dice'' B times. Consequently, we obtain an increased type 1 error. The probability of obtaining a false positive simply increases if you consider more trials. Thus, to make a fair comparison with the other methods we should adjust for this.
The probability of a type 1 error in this case is (for S = 0): where λ now defines the detection threshold and Z i := X i /n. To obtain the same type 1 error rate as the other methods (Eq. (B.2))  we equate them and solve for λn.
Thus, we should use λn as a function of κ in the current case to make a fair comparison. Then, the type 2 error (S = s > 0) probability is: We inspect how this behaves as a function of κ in the following section.

B.5. Comparison
Analytically relating these quantities is not trivial due to the CDFs. Instead, we switch to an empirical study by comparing the type 2 probabilities for different κ, B, and S, fixing n at unity. The results are shown in Fig. B.14. For small κ, our method performs relatively well. This is interesting if running with many type 1 errors (false positives) is not a problem (which is the case for our pipeline). However, averaging after dedispersion wins. The assumptions required usually make this approach unattainable in practice. 1. Perfect dedispersion 2. Uncorrelated noise 3. Stable signal across all bands If assumptions 1) and 2) are violated, we may actually risk amplifying the noise and end up in the case where we reduce the signal-to-noise ratio. Moreover, assumption 1) is infeasible in practice, as we would have to perform source-finding in images averaged for all possible dispersion measures. However, by running detections in separate subbands we still gain a type 2 performance increase over not using the multiple bandpasses or wrong dedispersion, which is also frequently done. Finally, by correcting for dispersion by doing a DM sweep, the probability of a type 1 error increases, which we did not account for in this analysis. We did take that into account for our method.

Appendix C. Sigma-clip iterations
We noted that since we have continuous estimates of the local statistics using convolutions, we do not require many iterations to do peak detection. This can be seen in Fig. C.15, where the performance of the source-finder is plotted as a function of the sigma-clip iteration. It is shown that after the first step, we already retrieve many more false positives than true positives.

Appendix D. Skymap simulation
In Fig. D.18 we depict a resulting simulated radio sky. To test our source-finder we simulate all-sky images and test how many sources we can recover. Here we detail the procedure. First, we sample source fluxes from an exponential distribution such that roughly 40% of the samples have SNR < 1. This follows (Vafaei Sadr et al., 2019). We sample point source locations uniformly within a circle with diameter D. Noise is sampled from a standard Gaussian. Additionally, we use a map of inhomogeneous extended emission (replicating galactic foreground and calibration remnants) such that the noise levels are not uniformly strong across the entire image (Fig. C.16). We sample random point-spread functions by augmenting a Gaussian ''backbone'' with random side-lobes that are integrated for 0.2 π rad. An example of such point-spread function is given in Fig. D.17.

Appendix E. Comparison to other telescopes
We exemplify the result we found in Appendix B. We consider a box-profile transient with DM = 400 and a duration of 1 s. It is simulated according to the same methods as described in using a dynamic spectrum with dν = 0.001 MHz and dt = 0.01 s and a signal to noise ratio of 7.5 in every such patch. After integrating (both noise and signal), the signal to noise ratio changes according to the telescope properties. Consider ∆ν and ∆t as a telescope's frequency resolution and snapshot length, respectively. We integrate the spectrum according to these time and frequency resolutions. Assuming standard noise, the variance and signal of each dν × dt pixel then adds. Using κ := 3, we find the minimum probability of not detecting this burst using our methods by Eq. (B.12). We compare this to fully integrating over the bandwidth and using the cumulative density of the resulting Gaussian. The false negative rates for a selection of instruments are shown in Table B.2. We see that for detecting dispersed signals integrating is not optimal. For higher DM, this becomes even worse, as more noise will be integrated.