Blog

December 29, 2024
in GitHub Actions, Python Package, Pytest
10 min read

Testing a Python Package on GitHub

I'm quite new both to building Python packages and also to GitHub actions so here I describe setting up a workflow to run tests on a Python package across Windows, Ubuntu, and macOS and across multiple Python versions with combined code coverage using GitHub actions and uv. I describe all the trial and error and mistakes I encountered with the hopes it can save others some time.

TLDR

I learnt how to set up a cross-platform / cross Python version workflow with coverage while setting up the GitHub repo for my CLI tool uv-secure. You can see the final GitHub action workflow I created for that project in pytest.yml but beware this workflow depends on some configuration specified in other files of this repo that I will go into in this post.

My project setup

I created my package using uv init with the lib option:

uv init --lib

To create the skeleton package directory structure. The default uv init command is more intended for creating scripts whereas if you want a Python project that will build and install as a package then you need the --lib option. You can read more detail in the uv init docs

That creates a repo structure with a

src / package_name

structure, and I personally like to add a top level tests directory (you need to do this manually - uv won't create a tests directory for you) and then duplicate the directory structure for the src directory for my tests structure, e.g.

tests / package_name

Creating initial GitHub workflow

The uv documentation has a specific section for GitHub integration which I found to be an excellent starting point for setting up a GitHub action. I started by copying their pytest example and later tried to incorporate their other example of using multiple Python versions.

First gotcha regarding Python versions

I made a few mistakes in my initial attempt to get multiple Python versions working that were a bit painful to figure out, specifically:

I wanted to target Python 3.9 minimum and every major version after that up to Python 3.13. Given I was targeting 3.9 minimum I had a .python-version file in my repo with the content:

3.9

which I had checked into source control. I used the setup-python action as the uv docs mentioned that can be faster for setting up Python and passed the Python version to that using the test matrix:

- name: Set up Python
  uses: actions/setup-python@v5
  with:
    python-version: ${{ matrix.python-version }}

I saw the setup-python action installing the correct Python, so I assumed it was running the tests against Python 3.9, 3.10, 3.11, 3.12, and 3.13... but unbeknownst to me uv was reading the .python-version file in each case and creating a Python 3.9 environment in each of the VMs even though actions/setup-python had installed the correct Python.

Python version mismatch

I didn't actually notice this issue until a later release when I tried to combine test coverage reports across my Python versions and noticed they were all giving identical results so coverage ignored all but one. The solution in this case was to remove the .python-version file from source control - I still kept a local copy but added it to the repo .gitignore file. Alternatively don't use the setup-python action and just follow the uv documentation example where uv installs Python itself.

The more general learning I got was to take the time to inspect the action output more closely for each step as you can see from the screenshot the problem was easy to spot - but only if you expand and look at the detailed output for each step.

Adding Code Coverage Support

There's several SaaS services for taking output from coverage in Python and tracking test coverage over time such as Codecov or coveralls both of which offer free (maybe with limitations) support for open source projects. However, I wanted to get some basic test coverage reporting that would work without using those services (yet) so I implemented some simpler action steps to take the coverage report from coverage and simply add that as a comment to the pull request.

The configuration for those steps looked like this in the end:

- name: Run coverage report
  run:  uv run coverage report > coverage_summary.txt
- name: Prepare Comment Body
  run: |
    echo '### Coverage Report' >> comment_body.md
    echo '```txt' >> comment_body.md
    cat coverage_summary.txt >> comment_body.md
    echo '' >> comment_body.md
    echo '```' >> comment_body.md
- name: Find Coverage Report Comment
  id: find-comment
  uses: peter-evans/find-comment@v3
  with:
    issue-number: ${{ github.event.pull_request.number }}
    comment-author: github-actions[bot]
    body-includes: '### Coverage Report'
- name: Create or Update Coverage Comment
  uses: peter-evans/create-or-update-comment@v4
  with:
    token: ${{ secrets.GITHUB_TOKEN }}
    issue-number: ${{ github.event.pull_request.number }}
    body-path: comment_body.md
    comment-id: ${{ steps.find-comment.outputs.comment-id }}
    edit-mode: replace

I had some trial and error and issues with the echo calls in the Prepare Comment Body step as I found I needed to use single tick quotes do get some of the echo calls to work (they would give really confusing errors with different quotes).

I learnt of Peter Evan's really handy actions for creating comments, although it took me a while to identify adding markdown comments worked best if saved to a file and added via the body-path argument in the create-or-update-comment action. I had many failed attempts trying to use ChatGPT's help to send the comment text via an environment variable that just caused so many issues.

At first each change to a pull request would create a new coverage comment which got messy pretty quickly but Peter had a nice example of combining his find-comment and create-or-update-comment actions so that you could create an initial comment for the test coverage and keep overwriting that same comment after that with new test coverage results as new commits were pushed to the PR.

The last gotcha I ran into with using the create-or-update-comment action was the default job permissions aren't adequate for jobs to change the pull request itself so you need to explicitly set the job permissions like this:

jobs:
  my_job_name:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write

Adding multi-Python version coverage report

I initially had multiple Python versions running and I just triggered test coverage for one of the Python versions as a separate step. This was unsatisfactory though as there was conditional importing happening in my package, e.g. toml parsing was added to the standard library in later versions of Python but earlier versions required you to import a separate package, e.g.

if sys.version_info >= (3, 11):
    import tomllib as toml
else:
    import tomli as toml

and it was disappointing running coverage for just one Python version would always omit one of these import lines. I learnt Coverage.py CLI has a combine command so you can run coverage for different environments and combine the results into a single coverage result.

I had some initial misunderstandings around the combine command as I thought it might combine one of the output files from running coverage, but it actually wants to combine the .coverage SQLLite db that coverage produces when running. Another gotcha that I ran into was when running coverage with an intent to combine results the coverage docs mention using the:

--parallel-mode

option or configuration to name the .coverage to include extra details to ensure each .coverage file is named uniquely and I tried to follow that advice. I had an issue since I was using the pytest-cov package to integrate pytest with coverage to run the tests with coverage, and pytest-cov appears to set the parallel option for its own purposes so my attempts at enabling parallel mode never seemed to change the name of the .coverage output db.

One possible solution would be to get coverage to run pytest without using pytest-cov ( although if you use the coverage feature in recent versions of the VSCode Python extension that depends on pytest-cov so you may want to keep that dependency even if you don't use it in CI).

In the end though it wasn't an issue because each test matrix instance runs in its own VM so you don't need to worry about .coverage files overwriting each other - as long as you use the upload-artifact action to upload them with different names there's no issue.

In the test run step I used the upload-artifact like so:

- name: Upload coverage artifact
  uses: actions/upload-artifact@v4
  with:
    name: coverage-${{ matrix.python-version }}
    path: .coverage
    include-hidden-files: true

and a separate dependant job would collect and combine those individual test artifacts like so:

- name: Download all coverage artifacts
  uses: actions/download-artifact@v4
  with:
    path: coverage
- name: Combine coverage data
  run: |
    uv run coverage combine coverage/3.*/.coverage
    uv run coverage report > coverage_summary.txt

The include-hidden-files argument for upload-artifact was another gotcha as by default upload-artifact won't upload hidden files - such as those with a period prefix so to get .coverage to upload you need to set that to true.

Final step - adding multiple platform coverage support

In the same way I had conditional imports that imported different libraries depending on the Python version - I also ended up needing conditional imports to handle different operating system specific dependencies, e.g. for optimised async event loops uvloop is often recommended but that only works on Linux and macOS (and I hate it when Windows is neglected by Python packages since that is the OS I work in the most). Once again I need a different conditional import on Windows instead of uvloop I use winloop and the code looks like this:

if sys.platform in ("win32", "cygwin", "cli"):
    from winloop import run
else:
    from uvloop import run

If you're interested the project dependencies section of my pyproject.toml for handling those platform specific dependencies looks like this:

dependencies = [
    'tomli; python_version < "3.11"',
    "uvloop>=0.21.0 ; sys_platform != 'win32'",
    "winloop>=0.1.7 ; sys_platform == 'win32'",
]

so now the next issue is the GitHub matrix must be expanded to cover Linux, macOS, and Windows and all the major Python versions which now looks like this:

name: Pytest
on:
  pull_request:
jobs:
  pytest-with-coverage:
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
        os: ["ubuntu-latest", "macos-latest", "windows-latest"]
        python-version: ["3.9", "3.10", "3.11", "3.12", "3.13"]

One minor gotcha is don't forget to quote the Python version numbers - I tried unquoted versions and Python 3.10 was trying to install Python 3.1!

The big gotcha happened though when I tried the step to combine the .coverage across my multi-platform artifacts. The tests and individual coverage ran just fine but the combine step returned an error status code but no meaningful / actionable error. I spent a lot of time trying to reproduce the combine problem locally with the CI generated artifacts and adding extra temporary debug steps into the pipeline before I determined the issue was each GitHub CI ran the tests in a different absolute paths in their respective VMs and coverage didn't recognise these different source paths as being the same. Specifically the VM specific paths were:

ubuntu-latest ran in: /home/runner/work/
macos-latest ran in: /Users/runner/work/
windows-latest ran in: D:/a/

I had run into this coverage directory mapping issue before when trying to run coverage in PyCharm with a Docker environment because of the different (and very non-obvious) absolute paths it creates for the same project directory.

In short, you need to configure coverage to recognise all these absolute directories as the same path. Which I did in the pyproject.toml like so:

[tool.coverage.paths]
source = [
  "src",
  "/Users/runner/work/uv-secure/uv-secure/src",
  "/home/runner/work/uv-secure/uv-secure/src",
  "D:/a/uv-secure/uv-secure/src"
]
tests = [
  "tests",
  "/Users/runner/work/uv-secure/uv-secure/tests",
  "/home/runner/work/uv-secure/uv-secure/tests",
  "D:/a/uv-secure/uv-secure/tests"
]

You can read about other ways to configure these path mappings for coverage in its path docs

Finally after all that trial and error I was able to get a single coverage report workflow that covered all the major OSes and Python versions I wanted to cover. The coverage comment renders like this:

Python version mismatch

By chance while developing this I did discover the cookiecutter-uv repo which covers a lot of the same issues (but doesn't as far as I can see deal with combining the code coverage).

I know the tox and nox tools are often used for testing across multiple Python versions. I don't have experience with them though, or know how they could be best integrated into GitHub CI.

Feel free to look at and re-use the final workflow file in your own repos which is here pytest.yml

I leaned on ChatGPT a lot which was good at suggesting the initial workflow configuration, and it helped a lot with diagnosing the action output (by pasting in the workflow yaml and the actual action log output) - but ChatGPT definitely lead me into some dead ends where I needed to research and solve the problem myself (like the proper way to use the create-or-update-comment action).

I've definitely found developing GitHub actions/workflows a bit painful as I spam push commits with lots of trial and error until I get them to work right. I haven't tried but have heard of the act tool that can apparently let you test GitHub actions locally before pushing them. I'd be interested to hear how well that has worked for other developers and what limitations it might have.

November 20, 2023
in Maritime, Data Visualisation, Python, Matplotlib, GeoPandas
2 min read

Plotting Australian Maritime Safety Authority Vessel Positions

Heatmap of log scaled ship positions for over a decade

Martime Vessel Position Heatmap

One frame of the ship movement animation

Maritime Vessel Positions

The Australian Maritime Safety Authority Digital Data website has maritime vessel positions dating back over two decades. I love working with data that has a spatial and visual component as this sort of data is fun to visualise and easy to understand.

In this post I was going to share some tooling I created to make it easier to download and work with.

The main pain points I found working with this data is that it is difficult to easily pull down in bulk (since the webpage requires you accepting terms for each file you download) and the files you download are double-nested zip files containing a shape file. When I first worked with this data a few years ago I downloaded many years of it one month worth at a time and hand-extracted the zip files but it was onerous. This time I've prepared two scripts which I've saved in my scrape_amsa_vessel_positions repo , one to download all the zip files, and a second script to open the shape file in each double nested zip file and re-save it as a geo-parquet file. Both scripts by default only download and extract new data so you don't need to worry about re-running them.

Once you have a full directory of geoparquet files its very simple to open all the parquet files and concatenate them into a single GeoPandas dataframe if you like (beware this will consume tens of gigabytes of RAM if you try to combine the full history though).

There's a few ways to visualise this data once you've collected it. I like leveraging the Matplotlib integration with FFmpeg to create mp4 movies over time. Another easy way to visualise this data on a static image which looks really pretty is to create a heatmap with the Datashader package.

Some of my different visualisations of this data I'll showcase in the visualise_amsa_vessel_positions repo

September 19, 2020
in Conda, Virtual Environments
11 min read

Python Conda environments - The Sharp Edges and How to Avoid Them

Conda is awesome

I'm a huge fan of Conda. If you work with Python packages that wrap or bind to applications or libraries written in other languages Conda is a fantastic way to install them as it not only installs the Python package but the application or library and the necessary configuration to tie them together too. For example I installed Tensorflow GPU the Google recommended way by installing the appropriate Nvidia drivers, then middleware libraries, then finally pip installing Tensorflow whilst tweaking environment variables and paths to integrate those all together and that not a fun way to burn away an hour or two of your life. With Conda I just type "conda install tensorflow-gpu" - go make a coffee and all that hassle is dealt with. Some packages like Nvidia RAPIDS can only be installed with Conda - unless you're brave enough to build it from source. Conda can save you from dependency management hell and give you back many hours of your life. You can even use it as a cross platform app installer too, for example installing and updating the popular QGIS application. Who needs Chocolatey, Homebrew, or apt-get when you have Conda.

But Conda can still burn you

Despite being more idiot-proof and much faster than most methods for installing packages with complex dependencies there are unfortunately still quite a few ways to trip up when using Conda so I wanted to cover some gotchas and hopefully spare others some of the pain I've experienced using Conda environments.

Even with best practises Conda environments will break anyway (but much less often)

Even if you follow all my recommendations your Conda environments will break sooner or later if you keep upgrading them. It's unfortunate but package authors are human and dependency management is hard. Occasionally a package will be released with a bug that evaded the package's test suite (the test coverage could be limited or the quality of the tests might vary across different platforms). Also semantic versioning and pinned dependencies aren't perfect either - the author of a foundational package can inadvertently make a breaking change to their API than can ripple through other dependent packages in your environment. The more packages you install into one environment and the more often you update those packages the odds start to stack up that one or more packages will be in a bad state and you'll find your scripts and code that used to work in that environment now break and error out.

It's not all bad news though, there's several strategies to restoring a broken environment which I'll cover next and if you're patient sometimes the package bugs or dependency issues will be fixed for you next time you do an update.

Record your environment configuration

My first practise when creating a new Conda environment is to document the packages I install. You can extract the package list after the fact using Conda's export option to save a Yaml file that you can re-use to create the environment. I'm assuming here you're using the terminal (Anaconda console on Windows) and not Anaconda Navigator or PyCharm IDE to create your environment. I like the Anaconda install if you're setting an environment up for a new Python user and you want a good mix of packages prior to covering virtual environments. If you're already at the stage of wanting to install additional packages and creating new virtual environments for them I strongly recommend using the Miniconda install and creating your environments using the terminal commands . For documenting my environments I prefer a simple technique - record the command line you used to create your environment should worst case you need to reinstall it (or recreate it on a new computer) to a document (I use a Google doc). Remember if you install additional packages into your environment at a later time to go back and update this document.

Screenshot

Part of the reason I use a document rather than just export the Conda environment is I use Jupyter Lab in a lot of my environments so I want to record the additional commands for installing and enabling the Jupyter Lab extensions should I need to recreate the environment (some Conda packages will install their corresponding Jupyter lab extensions ... but not all do and there's some debate on whether that is a good practise in the Jupyter lab community given Jupyter could be accessed from a different machine to the one with the Python package). This environment document becomes my ultimate fallback if my Conda environment breaks and I need to delete and reinstall the whole environment, but before resorting to reinstalling if an update has broken your environment there's some faster strategies to try first that will probably get things working again.

Restoring a broken Conda environment

Conda actually maintains a list of revisions to your environment each time you install a new package or update some existing packages. At any point if things go wrong you can revert to an earlier version of your environment. This is something I hadn't known about for months when first using Conda and I wasted a lot of time recreating environments when a simple restore would've been a lot quicker an easier. Check out the environment revision list and install revision commands. It is quite straight-forward to use:

it works a bit like a git revert, it creates a new revision that attempts to match the
configuration of the older target revision - it doesn't discard all subsequent
revisions since the target revision. This won't perfectly restore all your environment
state and there's a chance it won't fix a bad update but it's probably the first thing
to try if you notice you've gotten a bad update.

Pinning and installing a specific package version

Another option if you've been unlucky enough to update to a broken version of a particular package is to revert that specific package. You can use the same conda list --revisions command to see what the version of the package was when it was last working then explicitly install the good version of the package to force a downgrade. For example if you were an early adopter of Pandas 1.0 but ran into some of its bugs you could:

conda install pandas=0.25.3

To revert back to the previous Pandas version.

Tips to make Conda Environments break less often

I've covered a few options to handle Conda environments breaking and I did that first as it is only a question of when and not if it will happen. There are other steps you can take though to reduce the chances of environments getting into a broken state in the first place though which I'll cover next.

Keep your environments simple

As I mentioned earlier the more packages you install in one environment and the more often you update them the odds of that environment imploding get uncomfortably high. I have (more than once) gotten over enthusiastic with installing every cool sounding machine learning, data processing, and data visualisation package into one Über environment and I always end up regretting it. Beyond packages just breaking it can really slow down the dependency resolving when updating or installing new packages. You're also more likely to get stuck with older versions of packages, as bottlenecks will form between the complicated web of package dependencies if you let an environment become too bloated. The Anaconda base environment is probably a good example of a dangerously large number of packages in a single environment. It works there because they have painstakingly tested all the versions of those packages together for compatibility - if you were to manually add all those packages into a new environment or were brave enough to start updating all the packages in your base Anaconda environment or installing new packages to the base environment you'll run into trouble pretty quickly. Keep your environments simple, as a general rule don't update the base Anaconda environment (with the exception of the conda package itself which is recommended) and if you don't need to update your environment leave it be (I am a compulsive upgrader though which is why I have run into this pain point multiple times).

Enable channel priority strict

I didn't know about this option until I read the conda-forge instructions more carefully but enabling strict channel priority by running this command:

conda config --set channel_priority strict

or manually pasting the line

channel_priority: strict

into your .condarc file is a good idea.

Strict channel priority can save a lot of pain as the compiled libraries in different Conda channels aren't always binary compatible and if you start mixing packages from different channels (conda-forge and defaults channel packages for example) you can experience some nasty run time errors.

As I understand it, strict channel priority will always prefer packages from the higher priority channels (the channels listed first in your .condarc file if not manually specified) even if they are a lower version than the same package available from lower priority channels.

conda-forge is my favourite channel so I do use conda-forge instead of the defaults channel for most Conda environments - if I am updating my Miniconda base environment I will explicitly make it use the conda-forge channel on the command line (the first time you do this it will replace many of your default channel installed packages), e.g.

conda update -n base -c conda-forge conda conda-build conda-verify ripgrep mamba python

My general rule is try to prefer a single channel source for most packages in an environment (and always use the same channel priorities for updating an environment as what you did when you created it). If you know a package is pure Python that should be safe to install regardless of what channel it comes from, but for packages that contain compiled libraries it is riskier to install them from different channels into the same environment.

Other things to be wary of with Conda

That covers most of the important points but there's a few other less common issues I've encountered which I'll quickly go over in case anyone runs into these.

Mixing Pip and Conda packages

If you encounter a package on PyPi that doesn't exist on any Conda channel it is possible to pip install it into a Conda environment. I generally try to avoid this but if you do need to do it I believe the recommended approach is to create your Conda environment and install as many packages from Conda first as possible. Then pip install any packages you can't install with Conda - pip should recognise any dependencies it needs that have already been installed with Conda. I haven't experimented with this much, but I believe that you shouldn't go back and install or update different Conda packages after pip installing packages - you should remove and follow the same steps to create the environment again if you wish to update the Conda packages in an environment with pip packages installed (I don't have much experience with this but would welcome any feedback from other coders that have done this). I'd also mention if a PyPi package is pure Python it isn't very difficult to convert that to a Conda package (conda-forge has some good tools with tutorials and examples for doing this) so if you find a popular PyPi package that isn't on a Conda channel please consider converting it yourself.

Write permissions on the working directory on Windows

One issue I've encountered on corporate Windows machines (Anaconda/Miniconda can install to the user local directory and don't need local system admin in that case) is that when installing or updating I noticed some packages download temporary files to the working directory. I had an issue on a work computer where the Anaconda console would open up to the root C drive (with no write permission) and when attempting to create/update Conda environments Conda would just lock up - it didn't error out or log a message that I noticed (I had to use the Windows dev tools to diagnose the problem) but would just stay frozen on the package installing... My fix was to edit the Anaconda console shortcut to default to the user local directory instead of root C drive. This issue occurred a while ago and may be fixed now.

DLL hell on Windows

Another issue I found on Windows was DLL name clashes with Numpy which would happen every time I tried to import Numpy from a script run from the Anaconda console. The fix I found in that case was to set the CONDA_DLL_SEARCH_MODIFICATION_ENABLE environment variable to 1 - you can Google that environment variable and read about related fixes in the Conda troubleshooting guide.

Adding a Path directly to a Conda environment on Windows (Don't do this!)

Before I had a good answer to the DLL problem in the previous point a quick and dirty work-around was to manually add the Conda environment binary and script directories to my Windows Path (Anaconda/Miniconda will even offer to add the base environment to the Windows path for you when installing but recommends against it). This is a very tempting hack for new Conda users since it avoids the DLL and path resolutions issues that are a nasty pain point when starting out (I even see certain cloud VM configurations that use this hack) - but long term it is bad because it breaks the whole purpose of having multiple environments. If you activate a different Conda environment but leave the Path variable pointing to the libraries of the first Conda environment you're asking for some potentially nasty bugs when the wrong library files get used. The only time you might be able to get away with this is if you install Anaconda and only use the base environment and don't use any other Python environments on the same system.

I still have run into some DLL resolution issues even with the CONDA_DLL_SEARCH_MODIFICATION_ENABLE fix when debugging with Visual Studio Code. I found the trick there was to set a PYTHONPATH environment variable in the launch.json (PyCharm sets the PYTHONPATH variable to some sensible defaults out of the box for you but Visual Studio Code takes a bit more effort). As with regular Paths don't permanently add a PYTHONPATH environment variable to your system environment variables (that is no better than adding an environments binary paths to the system Path variable) - make sure the PYTHONPATH is configured by your IDE or within your terminal session - not permanently as a system environment variable.

Other issues

If a Conda install or update fails for other reasons it will normally provide you a link to a log file where you can find the specific details. I've had issues where some Conda packages depended on git being installed and I only noticed that from the log file. You can always fallback to the age old technique of Googling the exact error message as it appears in the terminal or log file and also check the troubleshooting guide mentioned above if you encounter anything else.

Conclusion

I hope this advice helps. I refer back to the Conda documentation often. If you find what appear to be bugs in Conda itself please do raise them on the Conda GitHub page. If you like Conda you can follow Anaconda on Twitter or follow Peter Wang - the founder of Anaconda himself as he occasionally responds directly to user questions.

If you spot any mistakes in my guide or have any experiences with Conda environments you think are useful to share feel free to message me from one of the links below. I'm always keen to learn more and share what I learn as I go.