Would also say corrupting gene names, but in this case the scientists literally decided to rename genes to not conflict with Excel's treatment of date-like strings.
Interestingly, genetic biologists are probably more aware of this problem than most. When importing a CSV containing gene names such as SEPT2 or MARCH1, they automatically get converted to dates by Excel. This has potentially had a fairly large effect on research in the area [1]. One of the many reasons we insist on only using Ensembl IDs for genes at my company.
One other problem not mentioned is that biologists have to name genes in a way that they won't be autocorrected or reformatted by Excel, especially those that Excel thinks are supposed to be dates [0]
The article is so annoying! I am pretty damn sure excel doesn't intentionally have any gene-renaming features, nor even has any concept of gene. So what are the scientists doing? Forgetting to put quotes on some data that is gene-names, and having it interpretted as function names or dates or something?
Yeah, why there is no "scientific" mode to just turn of all "smartness" has baffled me for years and will probably continue to baffle me for years to come.
Or even an algorithm that can detect that you are using gene name from the cells around march1 and sept7.
There's a gene called Septin-6, abbreviated Sept6. You can tell when gene expression data has been through an Excel cut/paste cycle, because Sept6 been converted to September 6. Oh Excel, you think you're so smart.
At the university I was working on a project to do parsing of gene relationships from literature. And yeah remember the inconsistencies. Also genes have funny names there is a SHH (Sonic Hedgehog), a DICER1 (which cuts something RNA or DNA I forgot), and a bunch of other silly ones.
Ultimately though coming from the world of algorithms and nicely organized data, it was frustrating how disorganized the nomenclature seemed.
There's a interesting recent paper, also cited by these folks, about errors in gene expression data. Spreadsheets (or at least Excel) are a major culprit, interpreting gene SEP15 as September 15, for example. This is dangerous when it happens in only part of the data, like some columns.
Gene name errors are widespread in the scientific literature
Don't forget that the genomics folks also renamed a whole bunch of genes a few years ago so now there are two different names for the same thing floating around!
https://pubmed.ncbi.nlm.nih.gov/27552985/ estimates that about one fifth of papers with supplementary Excel lists of genes contain mangled gene names. I remember talking about this problem back in 2003. The HGNC has been quietly going around changing the names of some of these genes to try and stop this from being a problem.
reply