Skip to content

Fix Sphinx copyright year substitution #312

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 17, 2025

Conversation

Image for: Conversation
Copy link
Contributor

cjwatson commented May 5, 2025

Using a Unicode en dash to separate years in the Sphinx copyright option results in unreproducible builds (in the sense of https://reproducible-builds.org/), because Sphinx's special handling of SOURCE_DATE_EPOCH only works when the separator is an ASCII hyphen. The Sphinx developers seem to have considered changing this and then decided against it (see
sphinx-doc/sphinx#12738), so it seems best to adjust the configuration here.

  • [N/A] Closes # (insert issue number)
  • Executed pre-commit run --all-files with no errors
  • [N/A] The change is fully covered by automated unit tests
  • [N/A] Documented in docs/ as appropriate
  • Added an entry to the CHANGES file
Using a Unicode en dash to separate years in the Sphinx `copyright`
option results in unreproducible builds (in the sense of
https://reproducible-builds.org/), because Sphinx's special handling of
`SOURCE_DATE_EPOCH` only works when the separator is an ASCII hyphen.
The Sphinx developers seem to have considered changing this and then
decided against it (see
sphinx-doc/sphinx#12738), so it seems best to
adjust the configuration here.
Copy link
Member

newville commented May 5, 2025

@cjwatson I don't understand what problem this solves. Somebody made up some "best practice" about formatting copyright dates in documentation?

If someone has a workflow based on parsing the copyright notice of the documentation, they should be brought to an awareness about the fragility of that assumption. Git it really good at marking and time-stamping changes to a collection of files, and tagged releases and/or pushes to PyPI should be used to decide changes in releases and their dates.

If we are touching the copyright notice for the docs, then @jagerber48 and I should be added as authors, or perhaps just change that to "Uncertainties Developers".

I vote to close.

Copy link
Contributor Author

cjwatson commented May 5, 2025

Sorry for apparently not explaining this very clearly.

There's an effort called "reproducible builds" which is trying to improve the situation with users being able to verify that binary packages they download match the corresponding source code. A key part of that is being able to build the same source code at different times (though with the same toolchains, dependencies, and so on) and get exactly the same results.

https://tests.reproducible-builds.org/debian/rb-pkg/unstable/amd64/uncertainties.html currently shows that when we tried to build the Debian package of uncertainties with the system clock artificially set to different times, we found differences in the results which amounted to the following in various places:

- © Copyright 2010–2025, Eric O. LEBIGOT (EOL).
+ © Copyright 2010–2026, Eric O. LEBIGOT (EOL).

The current Sphinx configuration for uncertainties sets the copyright option to f"2010–{date.today().year}, Eric O. LEBIGOT (EOL)", so it automatically picks up whatever the current year is from the system clock and claims that as the end date of the copyright. That approach inherently makes builds unreproducible, because they will change if built in a different year. However, since this sort of thing is quite a widespread problem, some years ago the people driving the reproducible-builds project standardized a SOURCE_DATE_EPOCH environment variable that can be used to communicate information such as the last relevant date from Git history or the changelog date of a package version or whatever to a build process.

Sphinx has included support for SOURCE_DATE_EPOCH for many years, which is part of its logic for deciding when to adjust the copyright option to match the current year. As you can see from https://github.com/sphinx-doc/sphinx/blob/master/sphinx/config.py#L769, this logic is sensitive to the punctuation used in the copyright option.

Based on all the above, I'm waiting for confirmation from Debian's reproducible-builds testing machinery that adjusting the punctuation here fixes build reproducibility across year boundaries, but I think it should.

If you think it's wrong to pick up the end date from the system clock, then changing to a fixed end date rather than {date.today().year} in doc/conf.py would certainly be an option too. I didn't suggest that because I try to stick to minimal changes when I'm proposing this sort of fix to projects I'm not particularly involved with, but that approach should also fix the problem.

Of course I'm not in a position to express any view on the question of who should be listed in the copyright option.

Copy link
Member

newville commented May 5, 2025

@cjwatson Why would a "reproducible build" process which clearly states (100% I know what this thing is):

reproducible builds are a set of software development practices that create an independently-verifiable path from source to binary code.

be concerned about the copyright notice for the documentation?

I'm sort of still leaning towards "Nope".

Anyway, the whole "reproducible build" thing is inherently uninteresting for this project, as it is pure Python. There is no path from source code to binary -- there is nothing to verify.

Copy link
Contributor Author

cjwatson commented May 5, 2025

It is not in general sustainable or realistic to do reproducibility checks with a pile of exceptions for things that might be different, not least because small differences have a way of embedding themselves into nested file formats. We check reproducibility by confirming that the build outputs (which in this case includes the generated documentation) are bit-for-bit identical.

By the same token, it's not sensible to have exceptions for packages that ostensibly have trivial build procedures. Sure, the process of going from the source tree for uncertainties to a binary output might be trivial. That does not prevent somebody from injecting code into a build machine (whether that's for wheels, .debs, RPMs, or whatever) that includes some kind of stealthy modification in the build outputs. Reproducible builds give us a way to detect that kind of attack, even if it's only after the fact, by running independent builds and comparing the outputs. And doing that is a lot more effective when the outputs can actually be reliably compared, without having to maintain a complex set of known exclusions. (This is just one package among many.)

Copy link
Contributor

jagerber48 commented May 5, 2025

I'm a bit puzzled by this also but changing a single character imperceptibly seems low cost.. I just need to learn more about what the issue is. For example, if the issue is originating with sphinx then I would rather push hard on them than go around changing hyphens in 1000s of repositories.

@cjwatson how did this issue come to your attention for this package? Are you doing some scrape of many popular pypi packages? Or did this actually break a workflow somewhere when analyzing some other package? Or something else?

Copy link
Collaborator

wshanks commented May 5, 2025

I was also puzzled at first because why does a Sphinx configuration affect what goes into the Python package, but looking here I see that debian actually packages the Uncertainties documentation as a separate package. I can also see there that @cjwatson helps maintain it. Personally, I don't mind changing which unicode character is used for the hyphen in the Sphinx copyright to the form that some part of Sphinx already assumes.

Copy link
Member

newville commented May 5, 2025

A packaging system should have a copy of the source kit, and store some hash/checksum key of that kit to prevent alteration of the source code. It can build and install from that source kit any way it wants, including making any patches it wants.

Why should we have to change our code to match the assumptions of some packaging system?

Copy link
Contributor

Sphinx has included support for SOURCE_DATE_EPOCH for many years, which is part of its logic for deciding when to adjust the copyright option to match the current year. As you can see from https://github.com/sphinx-doc/sphinx/blob/master/sphinx/config.py#L769, this logic is sensitive to the punctuation used in the copyright option.

Yeah so this is the problem. It's a bit insane that something depends on the exact punctuation and then repos are being asked to make these punctuation changes.

One part of me says this is a minor change that we can make to make someone's life easier. But another part of me is tempted to reject this so that the right parties (sphinx and reproducible-builds) figure out a better resolution to eliminate the punctuation sensitivity.

Copy link
Contributor

It sounds like right now the punctuation sensitivity in sphinx is retained so as not to break packages that are deliberately working around this in some way. But I guess now other package, who don't really care about this, are being asked to make changes.

Copy link
Contributor Author

cjwatson commented May 5, 2025

@jagerber48 While I don't have figures for everything using Sphinx anywhere, https://tests.reproducible-builds.org/debian/issues/unstable/copyright_year_in_documentation_generated_by_sphinx_issue.html suggests that this is a pretty rare issue at least in the set of packages in Debian. In fact, although I haven't looked very deeply, the other package listed there seems to be unreproducible for some other reason and not this one; so from what I can tell there is one significant package affected by this problem, certainly not thousands or probably even very many.

I noticed this because I'm on the Debian Python packaging team, which deals with the bulk of the Python modules that have been packaged in Debian for one reason or another (in this case probably because it's an indirect dependency of a few applications that somebody wanted to make available in our distribution). There are a lot of these and we generally rely on dashboards to keep things manageable. One of the things on our dashboard is whether packages build reproducibly; the great majority do, but a few such as this one don't.

@wshanks Yes, we often package documentation for various reasons. (Though I wasn't the one who set that up in this case; I've just been keeping the package up to date.)

@newville We can and do apply patches, of course; in fact I already applied this patch to our package, so this PR isn't blocking me. Most upstream maintainers prefer it if we send patches upstream wherever possible, though; and often similar problems affect more than one downstream distributor.

I don't want this to be an adversarial business. It's obviously up to you whether you take this patch; I can't really spend ages trying to persuade the maintainers of one package to take a one-character patch when we have thousands of packages to deal with! But it does seem as though perhaps the copyright option as currently written doesn't actually quite make sense for how this package is currently being maintained, so maybe you want to adjust it in some other way even if you aren't happy with this change.

Copy link
Collaborator

wshanks commented May 6, 2025

this is a minor change that we can make to make someone's life easier

This was my first reaction. I did not actually know en dash existed, only em dash and hyphen. Looking up en dash, I see that one of its main uses is to separate dates, so I am less inclined to switch to hyphen, though I will personally just keep using hyphen rather than en dash in my own typing. It seems like Sphinx should reconsider its behavior and offer an explicit opt out rather than behaving differently for hyphen and en dash.

I could see an argument for hard-coding the date. It is odd to think of someone else re-running the Sphinx code years from now and extending the copyright date. It is a chore to update it manually though.

Copy link

codecov bot commented May 6, 2025

Codecov Report

Image for: Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.54%. Comparing base (fead532) to head (c78a1c5).
Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #312   +/-   ##
=======================================
  Coverage   86.54%   86.54%           
=======================================
  Files          18       18           
  Lines        1903     1903           
=======================================
  Hits         1647     1647           
  Misses        256      256           
Flag Coverage Δ
macos-latest-3.10 95.58% <ø> (ø)
macos-latest-3.11 95.58% <ø> (ø)
macos-latest-3.12 95.58% <ø> (ø)
macos-latest-3.8 95.58% <ø> (ø)
macos-latest-3.9 95.58% <ø> (ø)
no-numpy 78.71% <ø> (ø)
ubuntu-latest-3.10 95.58% <ø> (ø)
ubuntu-latest-3.11 95.58% <ø> (ø)
ubuntu-latest-3.12 95.58% <ø> (ø)
ubuntu-latest-3.8 95.58% <ø> (ø)
ubuntu-latest-3.9 95.58% <ø> (ø)
windows-latest-3.10 95.58% <ø> (ø)
windows-latest-3.11 95.58% <ø> (ø)
windows-latest-3.12 95.58% <ø> (ø)
windows-latest-3.8 95.58% <ø> (ø)
windows-latest-3.9 95.58% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Copy link

codspeed-hq bot commented May 6, 2025

CodSpeed Performance Report

Image for: CodSpeed Performance Report

Merging #312 will not alter performance

Comparing cjwatson:fix-copyright-year-substitution (c78a1c5) with master (fead532)

Summary

✅ 5 untouched benchmarks

Copy link
Contributor

this is a minor change that we can make to make someone's life easier

This was my first reaction.

That's my thoughts too. This change is practically unnoticable.

I could see an argument for hard-coding the date. It is odd to think of someone else re-running the Sphinx code years from now and extending the copyright date. It is a chore to update it manually though.

Absolutely. I wonder how copyright dates are handled in other larger packages.

I think we should either

  • merge this
  • handle copyright dates in some other way, eg hard-coding
  • one of us puts a patch in to fix this in the other modules (this seems more effort than it's worth)
Copy link
Contributor

I vote +1 to merge this with some minor changes. @cjwatson has taken time clearly spelling out the case for this minor change. I don't feel a strong need to "send a message" with our response to this PR.

I would appreciate a minor message in the changelog pointing out how this is in response to a quirky behavior in sphinx. We can refer to sphinx-doc/sphinx#12738 in the changelog even. I think the discussion in this PR should be referenced also. @cjwatson if you don't want to do that then we can update the changelog ourselves when we make the next release, but the details are fresh in everyone's heads now so it will be easier. If need be, we can clean up the formatting etc. ourselves to conform to our loose conventions at release time.

But, I do think the sphinx behavior is essentially a bug and I'm frustrated that we had to have this conversation here and spend time/energy so that other packages can somehow rely on this buggy behavior as "backwards compatible" feature. So I would appreciate if an issue was opened with sphinx re-visiting sphinx-doc/sphinx#12738. I would be happy to put a comment on such an issue expressing this point.

Copy link
Contributor Author

cjwatson commented May 6, 2025

Maybe the best thing to do first is to loop in @jayaddison on this to see if I've missed anything in my analysis, or if they think it's likely that Sphinx might want to change anything about its handling of copyright years in light of this PR.

Copy link

jayaddison commented May 6, 2025

Hello - Sphinx is doing its best to enable reproducibility for documentation projects with programmatic copyright notices; possible, as here, because conf.py is dynamically-evaluated. As you're all no doubt aware, wrangling dynamically-evaluated notices to stable/reproducible outputs is never going be perfect, so it's a best-effort approach.

We've been aware of this limitation/problem, and from Sphinx v8.1.0 onwards there is a %Y substitution pattern that can be placed into static/constant copyright / project_copyright config strings. This is what I would generally recommend in this case - it should allow you to remove any year-calculation code, while retaining dynamic output. See the documentation at: https://www.sphinx-doc.org/en/master/usage/configuration.html#confval-copyright

(note regarding sphinx-doc/sphinx#12738: I don't have any magic tooling to assess the impact of changes in Sphinx itself on downstream projects (some of which will be closed-source) -- but the reason I pushed back on that was that I'm aware of at least one (open-source) project that intentionally did the opposite - they began using em-dash en-dash precisely to avoid Sphinx's substitution logic. instead of trying to pick a winning side, I wanted to accomodate existing projects and find a likely-mutually-acceptable path forward)

Edit: en-dash, not em-dash; thanks, @wshanks.

Copy link

Mea culpa: I've tested my own recommendation, and using the %Y substitution -- although I do still recommend it -- does not intrinsically use SOURCE_DATE_EPOCH; that remains reliant on the pattern-matching substitution, and therefore ASCII-hyphen.

I'm sorry about that; I'd forgotten how these features related to each other.

Copy link

At the risk of being spammy: the change that I had hoped to suggest, before finding that it does not work, was:

index 28dadeb..5412ee4 100644
--- a/doc/conf.py
+++ b/doc/conf.py
@@ -44,7 +44,7 @@ master_doc = "index"
 
 # General information about the project.
 project = "uncertainties"
-copyright = f"2010–{date.today().year}, Eric O. LEBIGOT (EOL)"
+copyright = f"2010–%Y, Eric O. LEBIGOT (EOL)"
 
 # The version info for the project you're documenting, acts as replacement for
 # |version| and |release|, also used in various other places throughout the

I have opened a feature/change suggestion thread to make such a change possible, and would welcome any thoughts/support/criticism, at: https://github.com/orgs/sphinx-doc/discussions/13526

Copy link
Member

newville commented May 6, 2025

I am still -1 -- willing to be out-voted.

The current code is not broken in a way that we see in our tests or usage. This is a patch for their code and assumptions, not ours. There are no guarantees that we won't "break" this at some point in the future.

We are not obligated to support others building the docs or guarantee that building the docs will yield byte-for-byte compatible results.

As an example: the lmfit-py documentation uses the Sphinx gallery package to generate results that includes the run time to milliseconds and auto-generated images (see https://lmfit.github.io/lmfit-py/examples/example_fit_with_inequality.html#sphx-glr-examples-example-fit-with-inequality-py for example). It is not byte-level reproducible. The assumption that rendered docs must be byte-level reproducible is flawed.

Again, willing to be out-voted.

Copy link
Contributor

@jayaddison

but the reason I pushed back on that was that I'm aware of at least one (open-source) project that intentionally did the opposite - they began using em-dash precisely to avoid Sphinx's substitution logic. instead of trying to pick a winning side, I wanted to accomodate existing projects and find a likely-mutually-acceptable path forward)

But this seems odd. Having code reliant on whether a certain character is an en-dash or a hyphen is very fragile and it seems a bit crazy to have or support any code that is reliant in that way. I don't fully understand the SOURCE_DATE_EPOCH feature and how it is utilized, but couldn't there be an explicit configuration option in conf.py to either opt in or out of usage of that feature? If such a feature existed I wouldn't hesitate to change our conf.py to opt in as an override of the en-dash opt out if that is maintained for some temporary period of time. But changing this unicode character so that other people code works feels frustrating.

@newville

As an example: the lmfit-py documentation uses the Sphinx gallery package to generate results that includes the run time to milliseconds and auto-generated images (see https://lmfit.github.io/lmfit-py/examples/example_fit_with_inequality.html#sphx-glr-examples-example-fit-with-inequality-py for example). It is not byte-level reproducible. The assumption that rendered docs must be byte-level reproducible is flawed.

This is a good point. I fully agree with this. Especially

The current code is not broken in a way that we see in our tests or usage. This is a patch for their code and assumptions, not ours. There are no guarantees that we won't "break" this at some point in the future.

Maybe I haven't yet been burned enough by other people's demands on the codebase I maintain, but for this one I'm willing to just pass it through to make them happy. Obviously if another reproducible build issue comes up I would be less willing to comply for tenuous justification...

Copy link
Collaborator

wshanks commented May 6, 2025

The Wikipedia entry for en dash lists connecting ranges as the first use case for en dash, so the current usage is valid. On the other hand, even the Wikipedia entry notes that some style guides favor hyphen over en dash:

Preference for hyphen vs. en dash in ranges varies. For example, the APA style (named after the American Psychological Association) uses an en dash in ranges, but the AMA style (named after the American Medical Association) uses a hyphen

So changing it is not a complete violation of punctuation convention.

I agree though that requiring this kind of minor change does not seem like the right way to achieve something like reproducible builds because it does not scale well , but I know reproducible builds is a long-running effort and probably anything I could come up with on the fly has already been considered.

It seems like we have a -1 vote and a +1 vote and then two ambivalent votes leaning towards +1, so we will probably merge this. If it seemed like we would vote it down, I would suggest we make the change to %Y and perhaps @jayaddison's suggestion will make that work with reproducible builds in the future. Perhaps we should still change to %Y as well any way (it just requires setting the minimum Sphinx version to 8.1; the readthedocs build is using 8.2 currently).

Copy link
Contributor

jagerber48 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Let's just clear this out. Will merge in the next couple days unless someone else does.

jagerber48 merged commit 3df41fd into lmfit:master May 17, 2025
22 checks passed
cjwatson deleted the fix-copyright-year-substitution branch June 7, 2025 23:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
6 participants