Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Interestingly, genetic biologists are probably more aware of this problem than most. When importing a CSV containing gene names such as SEPT2 or MARCH1, they automatically get converted to dates by Excel. This has potentially had a fairly large effect on research in the area [1]. One of the many reasons we insist on only using Ensembl IDs for genes at my company.

[1] https://genomebiology.biomedcentral.com/articles/10.1186/s13...



view as:

Just curious, but what about non-vertebrates? I'd have expected there to be an official number/hash that identifies genes like the InChI Key for chemistry or something. IIRC, that key in particular is just a SHA-256 of a long human-readable "chemical formula".

We'll cross that bridge when we come to it I guess, but we work almost exclusively with human and mouse genomes for now.

In any case, I imagine the Ensembl ID is still safer than other encodings in the case of invertebrates. For example, genes IDs in the Fruit fly genome look like FBgn0034730.


FWIW, I (not a biologist, though) only use LibreOffice for importing CSV these days. It allows me to look at the fields first and tell it if I want to suppress special treatment of data in a column.

EDIT: LibreOffice also allows you to tell it what encoding a file uses and what character(s) are used as separators.


I noticed this in the data of some scientists I work with. Another awful thing is that when you tell them they need to format the column as text to prevent this in the future, before the data is put in the column (very important!), they'll eventually try to apply it to their existing fubar spreadsheets as well - in which case the "date-recognized" genes become ... large numbers representing the number of days since 1900, totally unrecognizable.

Legal | privacy