Some career do-nothing-but-make-noise in my organization hired a firm to 'Do AI' on some shitty data and the outcome was basically linear regression. It turns out that you can impressive executives with linear regression if you deliver it enthusiastically enough.
You do unsupervised learning without labels with a linear regression. Interesting. What would you regress in this case? The problem is the following: you have a point cloud of data (electronic signal from arrays arranged into an irregular pattern). You know the physics that was discovered. You are looking for rare events (one in a billion or less) and you don’t know what they look like.
And you think we did not try linear regressions? This is what we used to do 20 years ago. Then we gained two orders of magnitude in signal-to-background discrimination. And since our data are not even images, off-shelf solutions mostly don’t apply. Try to process40 MHz of incoming collisions (1 MB each) within 100 nsec with a linear regression of point-cloud data. When you are done trying, try to think that maybe (maybe…) life is not as easy as bread&butter. If you succeed, come and knock at CERN’s door. Maybe we will let you in…
Not everyone knows everything so knowledge is the new oil.
I do know about linear regression even had quite some of it at university.
But I still wouldn’t be able to just implement it on some data without good couple days to weeks of figuring things out and which tools to use so I don’t implement it from scratch.
This is essentially what any relu based neural network approximately looks like (smoother variants have replaced the original ramp function). AI, even LLMs, essentially reduce to a bunch of code like
let v0 = 0
let v1 = 0.40978399*(0.616*u + 0.291*v)
let v2 = if 0 > v1 then 0 else v1
let v3 = 0
let v4 = 0.377928*(0.261*u + 0.468*v)
let v5 = if 0 > v4 then 0 else v4...
Thats a bit far. Relu does check x>0 but thats just one non-linearity in the linear/non-linear sandwich that makes up universal function approximator theorem. Its more conplex than just x>0
Multiply-accumulate, then clamp negative values to zero. Every even-numbered variable is a weighted sum plus a bias (an affine transformation), and every odd-numbered variable is the ReLU gate (max(0, x)). Layer 2 feeds on the ReLU outputs of layer 1, and the final output is a plain linear combination of the last ReLU outputs
// inputs: u, v
// --- hidden layer 1 (3 neurons) ---
let v0 = 0.616*u + 0.291*v - 0.135
let v1 = if 0 > v0 then 0 else v0
let v2 = -0.482*u + 0.735*v + 0.044
let v3 = if 0 > v2 then 0 else v2
let v4 = 0.261*u - 0.553*v + 0.310
let v5 = if 0 > v4 then 0 else v4
// --- hidden layer 2 (2 neurons) ---
let v6 = 0.410*v1 - 0.378*v3 + 0.528*v5 + 0.091
let v7 = if 0 > v6 then 0 else v6
let v8 = -0.194*v1 + 0.617*v3 - 0.291*v5 - 0.058
let v9 = if 0 > v8 then 0 else v8
// --- output layer (binary classification) ---
let v10 = 0.739*v7 - 0.415*v9 + 0.022
// sigmoid squashing v10 into the range (0, 1)
let out = 1 / (1 + exp(-v10))
The relu/if-then-else is in fact centrally important as it enables computations with complex control flow (or more exactly, conditional signal flow or gating) schemes (particularly as you add more layers).
I'm sure I've seen basic hill climbing (and other optimisation algorithms) described as AI, and then used evidence of AI solving real-world science/engineering problems.
Historically this was very much in the field of AI, which is such a massive field that saying something uses AI is about as useful as saying it uses mathematics. Since the term was first coined it's been constantly misused to refer to much more specific things.
From around when the term was first coined: "artificial intelligence research is concerned with constructing machines (usually programs for general-purpose computers) which exhibit behavior such that, if it were observed in human activity, we would deign to label the behavior 'intelligent.'" [1]
That definition moves the goalposts almost by definition, people only stopped thinking that chess demonstrated intelligence when computers started doing it.
The term artificial intelligence has always been just a buzzword designed to sell whatever it needed to. IMHO, it has no meaningful value outside of a good marketing term. John McCarthy is usually the person who is given credit for coming up with the name and he has admitted in interviews that it was just to get eyeballs for funding.
There are plenty of smart people in the "AI community" already who know it. Smugly commenting does not replace actual work. If you have real insight and can make something perform better, I guarantee you that many people will listen (I don't mean twitter influencers but the actual field). If you don't know any serious researcher in AI, I have my doubts that you have any insight to offer.
In my experiments, linear regression with extended (addition of squared values) attributes is very much competitive in accuracy terms with reported MLP accuracy.
The paper [1] referenced in your link follows the lagacy of the paper on the HIGGS dataset, and does not operate with quantities like accuracy and/or perplexity. HIGGS dataset paper provided area under ROC, from which one had to approximate accuracy. I used accuracy from the ADMM paper [2] to compare my results with. As I checked later, area under ROC in [1] mostly agrees with [2] SGD training results on HIGGS.
I think that perplexity measure is appropriate there in [1] because we need to discern between three outcomes. This calls for softmax and for perplexity as a standard measure.
So, my questions are: 1) what perplexity should I target when dealing with "mc-flavtag-ttbar-small" dataset? And 2) what is the split of train/validate/test ratio there?
For better or worse the people working on this don't really use perplexity or accuracy to evaluate models. The target is whatever you'd get for those metrics if you used the discriminants that were provided in the dataset (i.e. the GN2v01 values).
As for why accuracy and perplexity aren't reported: the experiments generally choose a threshold to consider something a "b-hadron" (basically picking a point along the ROC curve) and quantify the TPR and FPR at that point. There are reasons for this, mostly that picking a standard point lets them verify that the simulation actually reflects data. See, for example, the FPR [1] and TPR [2] "calibrations".
It's a good point, though, the physicists should probably try harder to report standard metrics that the rest of the ML community uses.
Perplexity, aka measuring how much a network is sure about its answer. Which might be wrong. It would not pass the pier review of any particle physics journal. (Real) science is about being right, not about being sure about itself.
And this problem is a joke compared to a real problem. We are talking about going from 40 MHz to 100 kHz incoming data stream, after which a second layer of real-time selection reduces the data to 1 kHz which is processed, cleaned, elaborated into high level features that you have in that dataset. But if you think you can do better, apply for a CERN job, come here and enlighten us!
How so? Half the people here have LLM delusion in every thread posted here; more than half of the things going to the frontpage are AI. Just look at hours where Americans are awake.
Fucking Americans. Only 4% of the world population, with the magic of disproportionately afflicting the global news headlines which make their way here.
One of the authors (of one of the two models, not this particular paper) here. Just a clarification, these models are *not* burned into silicon. They are trained with brutal QAT but are put onto fpgas. For axol1tl, the weights are burned in the sense that the weights are hard-wired in the fabric (i.e., shift-add instead of conventional read-muk-add cycle), but not on the raw silicon so the chip can be reprogrammed. Though, for projects like smartpixel or HG-Cal readout, there are similar ones targeting silicon (google something like "smartpixel cern", "HGCAL autoencoder" and you will find them), and I thought it was one of them when viewing the title.
Some slides with more info: https://indico.cern.ch/event/1496673/contributions/6637931/a...
The approval process for a full paper is quite lengthy in the collaboration, but a more comprehensive one is coming in the following months, if everything went smoothly.
Regarding the exact algorithm: there are a few versions of the models deployed. Before v4 (when this article was written), they are slides 9-10. The model was trained as a plain VAE that is essentially a small MLP. In inference time, the decoder was stripped and the mu^2 term from the KL div was used as the loss (contributions from terms containing sigma was found to be having negliable impact on signal efficiency). In v5 we added a VICREG block before that and used the reconstruction loss instead. Everything runs in =2 clock cycles at 40MHz clock. Since v5, hls4ml-da4ml flow (https://arxiv.org/abs/2512.01463, https://arxiv.org/abs/2507.04535) was used for putting the model on FPGAs.
For CICADA, the models was trained as a VAE again, but this time distilled with supervised loss on the anomaly score on a calibration dataset. Some slides: https://indico.global/event/8004/contributions/72149/attachm... (not up-to-date, but don't know if there other newer open ones). Both student and teacher was a conventional conv-dense models, can be found in slides 14-15.
Just sell some of my works for running qat (high-granularity quantization) and doing deployment (distributed arithmetic) of NNs in the context of such applications (i.e., FPGA deployment for <1us latency), if you are interested: https://arxiv.org/abs/2405.00645https://arxiv.org/abs/2507.04535
151 comments
https://arxiv.org/html/2411.19506v1
Why is it so hard to elaborate what AI algorithm / technique they integrate? Would have made this article much better
> I'm half expecting to see "AI model" appearing as stand-in for "linear regression" at this point in the cycle.
Already the case with consulting companies, have seen it myself
I do know about linear regression even had quite some of it at university.
But I still wouldn’t be able to just implement it on some data without good couple days to weeks of figuring things out and which tools to use so I don’t implement it from scratch.
is there something 'less good' about:
Am I the only one who stutter-parses "0 > value" vs my counterexample?Is Yoda condition somehow better?
Shouldn't we write: Let v1 = max 0 v0
From around when the term was first coined: "artificial intelligence research is concerned with constructing machines (usually programs for general-purpose computers) which exhibit behavior such that, if it were observed in human activity, we would deign to label the behavior 'intelligent.'" [1]
[1]: https://doi.org/10.1109/TIT.1963.1057864
At some point someone will realise that backpropagation and adjoint solves are the same thing.
[1] https://archive.ics.uci.edu/ml/datasets/HIGGS
In my experiments, linear regression with extended (addition of squared values) attributes is very much competitive in accuracy terms with reported MLP accuracy.
https://opendata-qa.cern.ch/record/93940
if you can beat it with linear regression we'd be happy to know.
The paper [1] referenced in your link follows the lagacy of the paper on the HIGGS dataset, and does not operate with quantities like accuracy and/or perplexity. HIGGS dataset paper provided area under ROC, from which one had to approximate accuracy. I used accuracy from the ADMM paper [2] to compare my results with. As I checked later, area under ROC in [1] mostly agrees with [2] SGD training results on HIGGS.
I think that perplexity measure is appropriate there in [1] because we need to discern between three outcomes. This calls for softmax and for perplexity as a standard measure.So, my questions are: 1) what perplexity should I target when dealing with "mc-flavtag-ttbar-small" dataset? And 2) what is the split of train/validate/test ratio there?
As for why accuracy and perplexity aren't reported: the experiments generally choose a threshold to consider something a "b-hadron" (basically picking a point along the ROC curve) and quantify the TPR and FPR at that point. There are reasons for this, mostly that picking a standard point lets them verify that the simulation actually reflects data. See, for example, the FPR [1] and TPR [2] "calibrations".
It's a good point, though, the physicists should probably try harder to report standard metrics that the rest of the ML community uses.
[1]: https://arxiv.org/pdf/2301.06319
[2]: https://arxiv.org/abs/1907.05120
It’s impressive, honestly.
Some slides with more info: https://indico.cern.ch/event/1496673/contributions/6637931/a... The approval process for a full paper is quite lengthy in the collaboration, but a more comprehensive one is coming in the following months, if everything went smoothly.
Regarding the exact algorithm: there are a few versions of the models deployed. Before v4 (when this article was written), they are slides 9-10. The model was trained as a plain VAE that is essentially a small MLP. In inference time, the decoder was stripped and the mu^2 term from the KL div was used as the loss (contributions from terms containing sigma was found to be having negliable impact on signal efficiency). In v5 we added a VICREG block before that and used the reconstruction loss instead. Everything runs in =2 clock cycles at 40MHz clock. Since v5, hls4ml-da4ml flow (https://arxiv.org/abs/2512.01463, https://arxiv.org/abs/2507.04535) was used for putting the model on FPGAs.
For CICADA, the models was trained as a VAE again, but this time distilled with supervised loss on the anomaly score on a calibration dataset. Some slides: https://indico.global/event/8004/contributions/72149/attachm... (not up-to-date, but don't know if there other newer open ones). Both student and teacher was a conventional conv-dense models, can be found in slides 14-15.
Just sell some of my works for running qat (high-granularity quantization) and doing deployment (distributed arithmetic) of NNs in the context of such applications (i.e., FPGA deployment for <1us latency), if you are interested: https://arxiv.org/abs/2405.00645 https://arxiv.org/abs/2507.04535
Happy to take any questions.
https://www.youtube.com/watch?v=8IZwhbsjhvE (From Zettabytes to a Few Precious Events: Nanosecond AI at the Large Hadron Collider by Thea Aarrestad)
Page: https://www.scylladb.com/tech-talk/from-zettabytes-to-a-few-...