Skip to content

Conversation

@robsdedude
Copy link

@robsdedude robsdedude commented Feb 10, 2026

@bedevere-app
Copy link

bedevere-app bot commented Feb 10, 2026

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@robsdedude robsdedude changed the title gh-144156: Fix email header wrapping omitting white space gh-144156: Fix email header folding concatenating encoded words Feb 10, 2026
@robsdedude robsdedude marked this pull request as ready for review February 11, 2026 13:32
@robsdedude robsdedude requested a review from a team as a code owner February 11, 2026 13:32
@bitdancer
Copy link
Member

FYI I've been looking at this and will have a review soonish. You are correct that the gh-92281 fix was buggy.

Copy link
Member

@bitdancer bitdancer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!

I edited this review a bunch, so mistake could have crept in to my code suggestions, though I did test them. The randomizer test should catch any transcription errors if nothing else does.

break
if lines_popped:
lines.append(' ' if lines else '')
return ''.join(wsp_lines)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary, see my suggestion below.

# words, we may need to encode the whitespace
# at the beginning of the second word.
last_word_is_ew = False
last_ew = None # Points to the last encoded character if there's an ew on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment that was added by 90281 is incorrect. It could say "if there is an encoded word in the last line of lines, points to the encoded word's first character". Which means it is a badly named variable (that's my fault). We could rename it 'last_ew_start', but I don't know if we want to do that in this PR. The comment should be fixed, though.

# This whitespace has been added to the lines in _fold_as_ew()
# so clear it now.
leading_whitespace = ''
part.ew_combine_allowed, charset, last_word_is_ew)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just reformat this:

last_ew = _fold_as_ew(
    tstr,
    lines,
    maxlen,
    last_ew,
    part.ew_combine_allowed,
    charset,
    last_word_is_ew,
    )

if len(tstr) <= maxlen - len(lines[-1]):
lines[-1] += tstr
if any(char not in WSP for char in tstr):
last_word_is_ew = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
last_word_is_ew = False
last_word_is_ew = not tstr.strip(_WSP) and last_word_is_ew

You'll need to add

_WSP = ''.join(WSP)

up at the top of the file where WSP is defined. (I'm actually adding that constant myself in a different PR I'm working on).

I haven't run timeit on it, but this should be more efficient as well as more compact. Hmm. I suppose it would be faster to have last_word_is_ew first, but then we should maybe wrap the tstr reference in bool for cleanliness. Up to you.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much faster indeed.

import timeit

_wSP = ' \t'
WSP = set(_wSP)

def one(s):
    return bool(s.strip(_wSP))

def two(s):
    return all(c in WSP for c in s)

def three(s):
    return not any(c not in WSP for c in s)

s1 = ' ' * 80
s2 = ' ' * 40 + 'a' + ' ' * 40
s3 = ' ' + 'a' * 80 + ' '

print('s1')
print(timeit.timeit(lambda: one(s1), number=10_000))
print(timeit.timeit(lambda: two(s1), number=10_000))
print(timeit.timeit(lambda: three(s1), number=10_000))

print('s2')
print(timeit.timeit(lambda: one(s2), number=10_000))
print(timeit.timeit(lambda: two(s2), number=10_000))
print(timeit.timeit(lambda: three(s2), number=10_000))

print('s3')
print(timeit.timeit(lambda: one(s3), number=10_000))
print(timeit.timeit(lambda: two(s3), number=10_000))
print(timeit.timeit(lambda: three(s3), number=10_000))
s1
0.014692971999920701
0.22478941200006375
0.22499246799998218
s2
0.01503182199985531
0.12194120699996347
0.1325552719999905
s3
0.011907245000202238
0.025579678999747557
0.025823422000030405

whitespace_accumulator.append(char)
leading_whitespace = ''.join(whitespace_accumulator)
if any(char not in WSP for char in lines[-1]):
last_word_is_ew = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
last_word_is_ew = False
last_word_is_ew = not tstr.strip(_WSP) and last_word_is_ew

Comment on lines +3010 to +3011
if any(char not in WSP for char in tstr):
last_word_is_ew = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if any(char not in WSP for char in tstr):
last_word_is_ew = False
last_word_is_ew = not tstr.strip(_WSP) and last_word_is_ew

# Therefore, we encode all to-be-displayed whitespace in the second
# encoded word.
leading_whitespace = _steal_all_trailing_WSP_if_exists(lines)
to_encode = leading_whitespace + to_encode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
to_encode = leading_whitespace + to_encode
len_without_wsp = len(lines[-1].rstrip(_WSP))
leading_whitespace = lines[-1][len_without_wsp:]
lines[-1] = lines[-1][:len_without_wsp] + (
' ' if leading_whitespace else '')
to_encode = leading_whitespace + to_encode

This avoids that complicated single use function. We should never be producing lines that contain only blanks, they get encoded.

You can also move this up to be the first elif and drop the 'and not last_word_is_ew'.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should never be producing lines that contain only blanks, they get encoded.

While testing, I noticed, too, that wrapped spaces are always encoded. TBH, I didn't like that and wanted to come up with a solution that's robust to changing this behavior in the future. I didn't like it, because reading the RFC and the rest of this code it sure seems like not encoding words is generally preferable because of its human readability (and to a lesser extent mail clients that don't support encoded words). Further, I thought assuming an invariant that's enforced a few hundred lines away in a different function is a nice recipe for spooky action at a distance.

If you anyway think the fix should assume this invariant, I'll change this part.

text_space = remaining_space - chrome_len
if text_space <= 0:
lines.append(' ')
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of my new tests revealed a pre-existing problem here:

Suggested change
continue
if text_space <= 0:
newline = _steal_trailing_WSP_if_exists(lines)
lines.append(newline or ' ')
new_last_ew = len(lines[-1])
continue

g = BytesGenerator(s)
g.flatten(msg)
self.assertEqual(s.getvalue(), expected)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a test harness I used to verify the code:

    def test_ew_folding_round_trip_random(self):
        from random import randint
        c = 30
        b = 30
        w = 30
        source = ''
        #source += ' ' * randint(0, b)
        source += ('a' * randint(1, c), 'ф' * randint(1, c))[randint(0, 1)]
        for i in range(randint(1, w)):
            source += ' ' * randint(1, b)
            source += ('a' * randint(1, c), 'ф' * randint(1, c))[randint(0, 1)]
        source += ' ' * randint(0, b)
        with open('temp', 'a') as f:
            print(source, file=f)
        msg = EmailMessage()
        msg['Subject'] = source
        s = io.BytesIO()
        g = BytesGenerator(s, maxheaderlen=30)
        g.flatten(msg)
        flat = s.getvalue()
        reparsed = message_from_bytes(flat, policy=policy.default)['Subject']
        self.assertMultiLineEqual(reparsed, source)

I'm not sure if we should include a variation on this in the PR or not, but you might want to run it a bunch, maybe tweaking the constants, to see if I missed anything. I ran it for a while with the above parameters and saw no failures.

Here are two tests that the above produced, that located the pre-existing bug I mentioned above:

    def test_ew_folding_round_trip_1(self):
        print()
        source = "aaaaaaaaa фффффффф "
        msg = EmailMessage()
        msg['Subject'] = source
        s = io.BytesIO()
        g = BytesGenerator(s, maxheaderlen=30)
        g.flatten(msg)
        flat = s.getvalue()
        reparsed = message_from_bytes(flat, policy=policy.default)['Subject']
        self.assertMultiLineEqual(reparsed, source)

    def test_ew_folding_round_trip_2(self):
        print()
        source = "aaa aaaaaaa   aaa ффф фффф  "
        msg = EmailMessage()
        msg['Subject'] = source
        s = io.BytesIO()
        g = BytesGenerator(s, maxheaderlen=30)
        g.flatten(msg)
        flat = s.getvalue()
        reparsed = message_from_bytes(flat, policy=policy.default)['Subject']
        self.assertMultiLineEqual(reparsed, source)

@@ -0,0 +1,3 @@
Fix folding of email headers violating `RFC 2047`_ with two consecutive encoded words without separating linear-white-space.

.. _RFC 2047: https://www.rfc-editor.org/rfc/rfc2047
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. _RFC 2047: https://www.rfc-editor.org/rfc/rfc2047
Fix the folding of headers by the :mod:`email` library when :rfc:`2047` encoded words are used. Now whitespace is correctly preserved and also correctly added between adjacent encoded words. The latter property was broken by the fix for gh-92081, which mostly fixed previous failures to preserve whitespace.

Out of curiosity I checked, and according to the changelogs the buggy fix was released in 3.14 alpha 1, 3.13 beta 2 and 3.12.5. Unfortunately the latter is now final. Maybe we should include that in the news item? I'm not sure what the policy is on that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should include that in the news item?

I'm all for documenting known issues and shortcomings. However, I doubt people would find this info in the release notes unless they're doing "software archeology", i.e., they already know (roughly) what info they're looking for.

I'm not sure what the policy is on that.

Neither am I. I'm pretty much a first-time contributor 😅

@bedevere-app
Copy link

bedevere-app bot commented Feb 12, 2026

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

Co-authored-by: R. David Murray <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants