Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Phone numbers are easy to canonicalize: convert to international form.

Email addresses can be effectively canonicalized by lower casing. Not many mail servers are case sensitive these days. Additionally, for the local part, you can generally strip off anything after a "+", and with gmail, you can drop any period in the local part. (Granted, it's not perfect-- so make sure that's not a security concern.)

These techniques have been working fine so far in my app for my "Find My Friends" feature.



sort by: page size:

The problem with your canonicalization idea is that it doesn't work in all cases. Yes, it might work with gmail addresses, but there's no way that you can assume that "david+x@example.com" and "david+y@example.com" are actually the same mailbox. If you so assume that, then you've just broken you websites for users where that isn't the case.

That's one of the reasons why it's recommended that email services to have case insensitive local parts. That and the fact that there are plenty of clients out there that will corrupt the address capitalization.

Why would you broadly assume an incorrect rule? Email address parsers correctly implement canonicalization rules that consider the domain e.g. gmail. It doesn't require any extra work as a developer and the logic is hidden behind the abstraction. But certainly, you shouldn't go implementing arbitrary rules that aren't reasonably applicable.

That's why I use normalization for gmail addresses that my users use. It solves the problem completely.

By canonicalization I'm not saying any arbitrary practice by Gmail or any other email provider should be considered as standard. I haven't looked at the RFC in some time, but I don't believe the use of plus suffixes is standard either. Nonetheless, I believe plus suffixes are more commonplace, generally permitted, and serve a reasonable purpose. For instance, sending email to a user using their email address as provided is a good practice in order to preserve a plus suffix which may aid the user in organizing their email. At the same time, canonicalizing email addresses in a sensible way, e.g. stripping plus suffixes, can be an aid for preventing unintentional, duplicate sign-ups. Just consider a sign-up form on the homepage of a website. It's not uncommon for people to enter their email and password into that form by mistake thinking they're signing in. Additionally, if the website compares canonical email addresses when checking login credentials, then a user who signed up with an email address containing a plus suffix can sign in using their base email. These two situations combined could lead to the accidental creation of a duplicate account if canonical email addresses are not compared during registration. There are some trade offs with this strategy, but as long as the canonicalization is implemented reasonably, I see it as an aid to the user. Note that reasonably doesn't necessarily mean stripping dots.

As for the second point, I consider it a privacy breach if a service publicly associates my email address with their service without my consent. Sign-up forms do this when giving different responses when an email address is registered vs not registered.

As for how to handle it, if a user signs up with a new email address, you send them an email to verify their email address and instruct them to check their email. Similarly, if a user attempts to sign up with an already registered email address, you send them an email letting them know they already have an account and instruct them to check their email, which will provide them with a link to login.

In the latter case, if they enter the correct password, you can just directly tell the user they already have an account, as they've proven their identity.


Thank you for responding.

I never said either that you are "normalizing away" the +bar thing, but that you are building in a non-standard convention to parse semantics of an email address which (mostly) hold true for one particular provider (gmail.com), without at all acknowledging that this is gmail-specific. Even if you are aware of it, your readers might not be.

In short, I do agree that your examples are silly, but I disagree that you demonstrate how they can do useful things: you mostly show how you can do silly things with them instead.

Note that you are suggesting this as a pattern: even if your users are accustomed to their email addresses being normalized, they might not be accustomed to other things being so. If one takes your advice, you'd have a FullName, FirstName, LastName types as well which does the normalization based on the language, and using eg. Unicode NF.

I do agree that you should do the things that have value, but I don't see you showing that value anywhere (as your example has a bunch of problems I highlighted, yet the benefits are?).

Note that I've been there as well early on in my career: it feels so satisfying to imagine a world where our data types smartly represent exactly the object they need to, with all the properties and actions one can do with it. However, it quickly breaks up in the real world, and you end up with IncomingEmailAddress, SMTPEnvelopeEmailAddress, UserProvidedEmailAddress etc instead.

This is not to say that if you've got a specific application that does some smart handling of an email address (like having aliasing rules for different popular providers) — eg. a mailing list app — but if that's not your business domain, a string is usually a better choice for what is essentially a string.

Let me give you another simple yet convoluted example that usually quickly breaks: compund types (even your EmailAddress would be used in different contexts). For simplicity, let's go to my example of Name: (FirstName, LastName). For display purposes, you might want to show Name as "<FirstName> <LastName>" on a profile page (except for East Asian names), but collate according to LastName. Now your Name needs to understand the context it's in, and what you hoped would be a trivial semantic structure is anything but.

All of this is to say: keep your data structures as simple as possible for as long as you can. Your future self will thank you.


the "algorithm" in this case is very simple. Periods are stripped entirely.

m.y.n.a.m.e@gmail.com and myname@gmail.com and my.name@gmail.com all canonicalize to the same account.


It’s even up to the email provider to make the local part completely or partially case-sensitive if they wish, but tell that to most government websites in the world. Probably a lost case at this point.

Periods are more of a Gmail-specific thing, but as you mention merging two people once in a blue moon is not a terrible price to pay for an analytics system, it’s not like it’s going to send mail to these addresses.


> Whether or not dots or +asdf is considered okay, an email address used for identification needs to be canonicalized in order to avoid duplicate sign-ups.

Since it isn't a standard or norm, how would that work? These are gmail exclusive features, and other services have other unique features.


Or they could just strip of the part after the "+", especially with gmail addresses.

Unfortunately convention has normalised user expectations that email addresses are now case-insensitive. It's now a standard business requirement. Too bad few devs fully handle Unicode.

huh, TIL email addresses can be case sensitive.

You can do something similar now with email aliasing services. Also they don’t include your primary address which is easily stripped from the + format you are describing.

or just force all lowercase and keep the support burden lower and save yourself a ton of trouble when dealing with foreign systems. Add a rule on incoming e-mail to convert all addresses for the local domain to lowercase to complete the package.

As for handling this in the real world, the approach I like best is to store email addresses the way they are entered, but create an index on the lowercase version for lookup and uniqueness checks.

Ah interesting. I guess the case insensitivity (for incoming email) is a decision of the popular services then, like gmails decision to consider johndoe equivalent to john.doe.

The email RFCs explicitly say thou shalt not interpret the localpart of an email address, unless thou art the MTA of the domain in question. Even case folding is forbidden. And the wisdom of people who work with email is... the RFCs have good advice here: don't assume anything about how the localpart is structured.

You can generally get away with treating the names as case-preserving (as distinct from case-insensitivity), and you are probably safe in rejecting quoted localparts. But beyond that, even forcibly lowercasing email addresses, is likely to cause problems.


I would note that while Gmail is the only one to ignore periods in the local part, many services ignore capitalization and face the same issue.

Many services do strip capitalization when checking email address uniqueness, but this is as much a mistake as stripping dots.


Google (and most email providers) also treat the user portion as case-insensitive.
next

Legal | privacy