Modify

Opened 9 years ago

Closed 7 years ago

#12058 closed defect (fixed)

Bad email default regexp

Reported by: Dirk Stöcker Owned by: Steffen Hoffmann
Priority: normal Component: AccountManagerPlugin
Severity: normal Keywords: registration antirobot
Cc: Trac Release: 0.11

Description

Default email regexp is:

(?i)^[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}$

it should be

(?i)^[A-Z0-9._%+-]+@(?:[A-Z0-9.-]+\.)+[A-Z]{2,6}$

or otherwise subdomains are not possible.

This still prevents Umlauts like äöü and so on. You can reach me e.g. with dstöcker.de domain :-)

Attachments (0)

Change History (15)

comment:1 Changed 9 years ago by Jun Omae

It seems the former of patterns matches subdomains....

>>> import re
>>> pattern = re.compile(r'(?i)^[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}$')
>>> pattern.match('john@example.com')
<_sre.SRE_Match object at 0x00AEA720>
>>> pattern.match('john@sub-domain.example.com')
<_sre.SRE_Match object at 0x00AEA758>
>>> pattern.match('john@subsub.sub-domain.example.com')
<_sre.SRE_Match object at 0x00AEA7C8>
>>> pattern.match('john@nth.sub.subsub.sub-domain.example.com')
<_sre.SRE_Match object at 0x00AEA7C8>

Another considering, seems that max length of TLDs is larger than 6, and digits and - characters are used in internationalized TLD. According to domain name syntax, each label of domain name is 1 to 63 octets.

-        r'(?i)^[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}$',
+        r'(?i)^[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z0-9-]{2,63}$',
Last edited 9 years ago by Jun Omae (previous) (diff)

comment:2 Changed 9 years ago by Dirk Stöcker

Regarding subdomain: a space at the end of the mail prevented matching, not the subdomain. Sorry for that. Still jun66j5's comment stays valid and other valid UTF-8 characters.

comment:3 Changed 9 years ago by Steffen Hoffmann

dstöcker.de

This is way off-standard, isn't it? Even if current standards allow it without question, where does this discussion end in UTF-8 land? Allow only German an Japanese chars, but no Russian ones? You're certainly able to do that. However I think, that an admin should open up such degrees of freedom only after very careful consideration.

Speaking of the subject, SPAM prevention, there are likely more UTF-8 spam email addresses than legitimate ones, not even to speak of the total of real users, that only own an email address with UTF-8 chars.

Anyway, Jun's change proposal seems valid, and I'll stick to it, if there is not much magic coming from UTF-8 wizards in time.

comment:4 in reply to:  3 ; Changed 9 years ago by Dirk Stöcker

Replying to hasienda:

dstöcker.de

This is way off-standard, isn't it? Even if current standards allow it without question, where does this discussion end in UTF-8 land? Allow only German an Japanese chars, but no Russian ones? You're certainly able to do that. However I think, that an admin should open up such degrees of freedom only after very careful consideration.

Well, world moved on. Still it is a bad hack, as the domain is converted with Punycode to ASCII first ("xn--dstcker-c1a.de") before using. No full UTF-8 support for domains yet and probably never coming. And there are different versions of the rulesets how to do that.

Because of these troubles I did not yet really use my umlaut domains at all for serious work and also never have seen any real use elsewhere. :-) But I wanted to mention it.

comment:5 in reply to:  4 Changed 9 years ago by Steffen Hoffmann

Replying to stoecker:

Replying to hasienda:

dstöcker.de

This is way off-standard, isn't it? ...

Well, world moved on.

... But I wanted to mention it.

Sure. In general I care a lot for localization, and your comments have helped me gaining yet a little more knowledge on current status. Thanks.

comment:6 Changed 9 years ago by Steffen Hoffmann

Keywords: registration antirobot added
Status: newaccepted
Summary: Bad emaill regexpBad email default regexp
Trac Release: 0.11

comment:7 Changed 9 years ago by Steffen Hoffmann

In 14257:

AccountManagerPlugin: Improve default email regexp pattern, refs #12058.

Thanks to Dirk and Jun for the report and fruitful discussion on the issue.

comment:8 Changed 9 years ago by Ryan J Ollos

Would the changes from this ticket be useful for updating EMAIL_LOOKALIKE_PATTERN in Trac? That pattern is used for finding email addresses in wiki content and extracting email address from a list of recipients.

Last edited 9 years ago by Ryan J Ollos (previous) (diff)

comment:9 Changed 9 years ago by Steffen Hoffmann

Maybe I did initially start my development from that pattern, and I searched for such similar occurrences but failed to find them.

Keep in mind that not even admins can change such built-ins easily (without cumbersome source patching). So the status-quo reached and used here might just be enough to go there, extend and enhance it to omake it fully standards-compliant, with UTF-8 that time.

comment:10 in reply to:  9 Changed 9 years ago by Ryan J Ollos

Replying to hasienda:

Keep in mind that not even admins can change such built-ins easily

I hope it was clear that I meant to propose backporting to Trac ...

(without cumbersome source patching). So the status-quo reached and used here might just be enough to go there, extend and enhance it to omake it fully standards-compliant, with UTF-8 that time.

... which I take from this statement that you understood my intention. I'll wait to see if Jun has a comment about the usefulness of backporting your regex to replace EMAIL_LOOKALIKE_PATTERN in the core, before opening a ticket at t.e.o.

comment:11 in reply to:  8 ; Changed 9 years ago by Jun Omae

Replying to rjollos:

Would the changes from this ticket be useful for updating EMAIL_LOOKALIKE_PATTERN in Trac? That pattern is used for finding email addresses in wiki content and extracting email address from a list of recipients.

I don't think that it's need to backport the updates to Trac core. EMAIL_LOOKALIKE_PATTERN currently matches TLDs with more than 6 octets and internationalized TLDs.

$ ~/venv/trac/1.0.2/bin/python
Python 2.5.6 (r256:88840, Oct 21 2014, 22:26:35)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> from trac.notification import EMAIL_LOOKALIKE_PATTERN
>>> re.match(EMAIL_LOOKALIKE_PATTERN, u'foo@example.community', re.UNICODE)
<_sre.SRE_Match object at 0x7f93eef039f0>
>>> re.match(EMAIL_LOOKALIKE_PATTERN, u'foo@example.xn--zckzah', re.UNICODE)
<_sre.SRE_Match object at 0x1a58ed0>

However, I noticed another issue in EMAIL_LOOKALIKE_PATTERN. The pattern is used \d with re.UNICODE. If a TLD has unicode digit characters, it wrongly matches. See http://trac.edgewall.org/demo-1.0/ticket/3692. That is a Trac issue and similar to t:#10678.

>>> re.match(EMAIL_LOOKALIKE_PATTERN, u'foo@example.c௩௪௫om', re.UNICODE)
<_sre.SRE_Match object at 0x1a58f38>

Trac core should use [0-9] rather than \d in EMAIL_LOOKALIKE_PATTERN.

-        '[a-zA-Z](?:[-a-zA-Z\d]*[a-zA-Z\d])?' # TLD
+        '[a-zA-Z](?:[-a-zA-Z0-9]*[a-zA-Z0-9])?' # TLD
Last edited 9 years ago by Jun Omae (previous) (diff)

comment:12 in reply to:  11 Changed 9 years ago by Steffen Hoffmann

Replying to jun66j5:

I don't think that it's need to backport the updates to Trac core.

Your explanation is convincing.

comment:13 Changed 9 years ago by Ryan J Ollos

I shouldn't have said changes in this ticket. I was thinking more broadly than just the TLD in considering differences in EMAIL_LOOKALIKE_PATTERN.

  • EMAIL_LOOKALIKE_PATTERN allows ' in the local part.
    >>> import re
    >>> from trac.notification import EMAIL_LOOKALIKE_PATTERN
    >>> email_regex = r'(?i)^[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z0-9-]{2,63}$'
    >>> re.match(email_regex, 'user%5@gmail.com')
    >>> re.match(EMAIL_LOOKALIKE_PATTERN, 'user%5@gmail.com')
    <_sre.SRE_Match object at 0x7f84a1313f38>
    
  • email_regexp used here % in local part
    >>> re.match(EMAIL_LOOKALIKE_PATTERN, 'user%5@gmail.com')
    >>> import re
    >>> from trac.notification import EMAIL_LOOKALIKE_PATTERN
    >>> email_regex = r'(?i)^[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z0-9-]{2,63}$'
    >>> re.match(email_regex, 'user\'@gmail.com')
    >>> re.match(EMAIL_LOOKALIKE_PATTERN, 'user\'@gmail.com')
    <_sre.SRE_Match object at 0x7f84a1313f38>
    
    It appear even more special characters are allowed in the local part.
  • email_regexp is simpler and more readable than EMAIL_LOOKALIKE_PATTERN IMO.

EMAIL_LOOKALIKE_PATTERN in Trac should be robust enough that it wouldn't need to be redefined in AccountManagerPlugin when used for the same general purposes. It appears that neither EMAIL_LOOKALIKE_PATTERN or email_regex are currently sufficient to match all valid email addresses according to RFC 6531.

The rules for valid syntax appear complex though. It seems like there should be some utility in the email or other package to handle this for us.

Last edited 9 years ago by Ryan J Ollos (previous) (diff)

comment:14 Changed 9 years ago by Steffen Hoffmann

You're asking for a single pattern to be used inside Trac, and ultimately in Trac plugins too. Reasonable. You're targeting Trac 1.0.3 for this change? Make sure to put a note down here when we could try to an import like

from trac.util.text import EMAIL_PATTERN

I don't think, a pre-compiled pattern ('EMAIL*_RE') is required for most applications.

comment:15 Changed 7 years ago by Ryan J Ollos

Resolution: fixed
Status: acceptedclosed

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain Steffen Hoffmann.
The resolution will be deleted. Next status will be 'reopened'.

Add Comment


E-mail address and name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.