-
-
Notifications
You must be signed in to change notification settings - Fork 34.1k
gh-144156: Fix email header folding concatenating encoded words #144692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
|
FYI I've been looking at this and will have a review soonish. You are correct that the gh-92281 fix was buggy. |
bitdancer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this!
I edited this review a bunch, so mistake could have crept in to my code suggestions, though I did test them. The randomizer test should catch any transcription errors if nothing else does.
| break | ||
| if lines_popped: | ||
| lines.append(' ' if lines else '') | ||
| return ''.join(wsp_lines) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is necessary, see my suggestion below.
| # words, we may need to encode the whitespace | ||
| # at the beginning of the second word. | ||
| last_word_is_ew = False | ||
| last_ew = None # Points to the last encoded character if there's an ew on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment that was added by 90281 is incorrect. It could say "if there is an encoded word in the last line of lines, points to the encoded word's first character". Which means it is a badly named variable (that's my fault). We could rename it 'last_ew_start', but I don't know if we want to do that in this PR. The comment should be fixed, though.
| # This whitespace has been added to the lines in _fold_as_ew() | ||
| # so clear it now. | ||
| leading_whitespace = '' | ||
| part.ew_combine_allowed, charset, last_word_is_ew) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just reformat this:
last_ew = _fold_as_ew(
tstr,
lines,
maxlen,
last_ew,
part.ew_combine_allowed,
charset,
last_word_is_ew,
)
| if len(tstr) <= maxlen - len(lines[-1]): | ||
| lines[-1] += tstr | ||
| if any(char not in WSP for char in tstr): | ||
| last_word_is_ew = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| last_word_is_ew = False | |
| last_word_is_ew = not tstr.strip(_WSP) and last_word_is_ew |
You'll need to add
_WSP = ''.join(WSP)
up at the top of the file where WSP is defined. (I'm actually adding that constant myself in a different PR I'm working on).
I haven't run timeit on it, but this should be more efficient as well as more compact. Hmm. I suppose it would be faster to have last_word_is_ew first, but then we should maybe wrap the tstr reference in bool for cleanliness. Up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much faster indeed.
import timeit
_wSP = ' \t'
WSP = set(_wSP)
def one(s):
return bool(s.strip(_wSP))
def two(s):
return all(c in WSP for c in s)
def three(s):
return not any(c not in WSP for c in s)
s1 = ' ' * 80
s2 = ' ' * 40 + 'a' + ' ' * 40
s3 = ' ' + 'a' * 80 + ' '
print('s1')
print(timeit.timeit(lambda: one(s1), number=10_000))
print(timeit.timeit(lambda: two(s1), number=10_000))
print(timeit.timeit(lambda: three(s1), number=10_000))
print('s2')
print(timeit.timeit(lambda: one(s2), number=10_000))
print(timeit.timeit(lambda: two(s2), number=10_000))
print(timeit.timeit(lambda: three(s2), number=10_000))
print('s3')
print(timeit.timeit(lambda: one(s3), number=10_000))
print(timeit.timeit(lambda: two(s3), number=10_000))
print(timeit.timeit(lambda: three(s3), number=10_000))s1
0.014692971999920701
0.22478941200006375
0.22499246799998218
s2
0.01503182199985531
0.12194120699996347
0.1325552719999905
s3
0.011907245000202238
0.025579678999747557
0.025823422000030405
| whitespace_accumulator.append(char) | ||
| leading_whitespace = ''.join(whitespace_accumulator) | ||
| if any(char not in WSP for char in lines[-1]): | ||
| last_word_is_ew = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| last_word_is_ew = False | |
| last_word_is_ew = not tstr.strip(_WSP) and last_word_is_ew |
| if any(char not in WSP for char in tstr): | ||
| last_word_is_ew = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if any(char not in WSP for char in tstr): | |
| last_word_is_ew = False | |
| last_word_is_ew = not tstr.strip(_WSP) and last_word_is_ew |
| # Therefore, we encode all to-be-displayed whitespace in the second | ||
| # encoded word. | ||
| leading_whitespace = _steal_all_trailing_WSP_if_exists(lines) | ||
| to_encode = leading_whitespace + to_encode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| to_encode = leading_whitespace + to_encode | |
| len_without_wsp = len(lines[-1].rstrip(_WSP)) | |
| leading_whitespace = lines[-1][len_without_wsp:] | |
| lines[-1] = lines[-1][:len_without_wsp] + ( | |
| ' ' if leading_whitespace else '') | |
| to_encode = leading_whitespace + to_encode |
This avoids that complicated single use function. We should never be producing lines that contain only blanks, they get encoded.
You can also move this up to be the first elif and drop the 'and not last_word_is_ew'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should never be producing lines that contain only blanks, they get encoded.
While testing, I noticed, too, that wrapped spaces are always encoded. TBH, I didn't like that and wanted to come up with a solution that's robust to changing this behavior in the future. I didn't like it, because reading the RFC and the rest of this code it sure seems like not encoding words is generally preferable because of its human readability (and to a lesser extent mail clients that don't support encoded words). Further, I thought assuming an invariant that's enforced a few hundred lines away in a different function is a nice recipe for spooky action at a distance.
If you anyway think the fix should assume this invariant, I'll change this part.
| text_space = remaining_space - chrome_len | ||
| if text_space <= 0: | ||
| lines.append(' ') | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of my new tests revealed a pre-existing problem here:
| continue | |
| if text_space <= 0: | |
| newline = _steal_trailing_WSP_if_exists(lines) | |
| lines.append(newline or ' ') | |
| new_last_ew = len(lines[-1]) | |
| continue |
| g = BytesGenerator(s) | ||
| g.flatten(msg) | ||
| self.assertEqual(s.getvalue(), expected) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a test harness I used to verify the code:
def test_ew_folding_round_trip_random(self):
from random import randint
c = 30
b = 30
w = 30
source = ''
#source += ' ' * randint(0, b)
source += ('a' * randint(1, c), 'ф' * randint(1, c))[randint(0, 1)]
for i in range(randint(1, w)):
source += ' ' * randint(1, b)
source += ('a' * randint(1, c), 'ф' * randint(1, c))[randint(0, 1)]
source += ' ' * randint(0, b)
with open('temp', 'a') as f:
print(source, file=f)
msg = EmailMessage()
msg['Subject'] = source
s = io.BytesIO()
g = BytesGenerator(s, maxheaderlen=30)
g.flatten(msg)
flat = s.getvalue()
reparsed = message_from_bytes(flat, policy=policy.default)['Subject']
self.assertMultiLineEqual(reparsed, source)
I'm not sure if we should include a variation on this in the PR or not, but you might want to run it a bunch, maybe tweaking the constants, to see if I missed anything. I ran it for a while with the above parameters and saw no failures.
Here are two tests that the above produced, that located the pre-existing bug I mentioned above:
def test_ew_folding_round_trip_1(self):
print()
source = "aaaaaaaaa фффффффф "
msg = EmailMessage()
msg['Subject'] = source
s = io.BytesIO()
g = BytesGenerator(s, maxheaderlen=30)
g.flatten(msg)
flat = s.getvalue()
reparsed = message_from_bytes(flat, policy=policy.default)['Subject']
self.assertMultiLineEqual(reparsed, source)
def test_ew_folding_round_trip_2(self):
print()
source = "aaa aaaaaaa aaa ффф фффф "
msg = EmailMessage()
msg['Subject'] = source
s = io.BytesIO()
g = BytesGenerator(s, maxheaderlen=30)
g.flatten(msg)
flat = s.getvalue()
reparsed = message_from_bytes(flat, policy=policy.default)['Subject']
self.assertMultiLineEqual(reparsed, source)
| @@ -0,0 +1,3 @@ | |||
| Fix folding of email headers violating `RFC 2047`_ with two consecutive encoded words without separating linear-white-space. | |||
|
|
|||
| .. _RFC 2047: https://www.rfc-editor.org/rfc/rfc2047 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| .. _RFC 2047: https://www.rfc-editor.org/rfc/rfc2047 | |
| Fix the folding of headers by the :mod:`email` library when :rfc:`2047` encoded words are used. Now whitespace is correctly preserved and also correctly added between adjacent encoded words. The latter property was broken by the fix for gh-92081, which mostly fixed previous failures to preserve whitespace. |
Out of curiosity I checked, and according to the changelogs the buggy fix was released in 3.14 alpha 1, 3.13 beta 2 and 3.12.5. Unfortunately the latter is now final. Maybe we should include that in the news item? I'm not sure what the policy is on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should include that in the news item?
I'm all for documenting known issues and shortcomings. However, I doubt people would find this info in the release notes unless they're doing "software archeology", i.e., they already know (roughly) what info they're looking for.
I'm not sure what the policy is on that.
Neither am I. I'm pretty much a first-time contributor 😅
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
Co-authored-by: R. David Murray <[email protected]>
Uh oh!
There was an error while loading. Please reload this page.