Alisceon.com

A Blog Writer's Workflow

· 4024 words · 19 minutes to read

A Blog Writer’s Workflow 🔗

So while I haven’t exactly posted a huge deal on this blog, as I suspected, I have been very active considering posting to it. To that effect my brain has been endlessly throwing ideas at me to improve how this site works in the backend, because procrastinating writing by programming is almost fine! So in this post I aim to detail a bit how this workflow works, from the humble keystrokes you are reading to the actual deployed HTML you are reading them on. Just like my other posts, I want this to be accessible even if you aren’t a computer wizard (but some technical knowledge will be assumed).

An overview of the process 🔗

As the little footer on this page may lead you to believe, this blog is built with Hugo. Hugo is a static site generator written in Go that gobbles up your structured textfiles and spits out rendered HTML. This means that all of my posts start out as Markdown (md) files in a directory structure that mimics that of the blog. At the time of writing the directory with the blog content looks a little something like this:

content/
├── about.md
└── texts
    ├── blogwriters-workflow
    │   ├── grimagoire-example-input.png
    │   ├── grimagoire-example-output.png
    │   ├── grimagoire-example-output.svg
    │   └── index.md
    ├── bluesky-phishing
    │   ├── 6v-page.png
    │   ├── index.md
    │   ├── landing-page.png
    │   ├── redirect-page.png
    │   └── scammer-intro.jpeg
    ├── _index.md
    ├── letter-to-myself
    │   └── index.md
    ├── writeup-kattastrofen
    │   └── index.md
    └── writeup-undutmaning-raring
        ├── index.md
        ├── wireshark-1.png
        └── wireshark-2.png

If I were more tryhard, I would offload all the non-text files to another content management system (CMS) but as this blog isn’t quite at that scale it is simply not needed. Hugo can use these files together with a theme to generate the content of the page in crisp HTML format which can be viewed by people like yourself! Good thing tools like this exist because there is not a single frontend-writing bone in my body, hence the minimalist aesthetic of this blog.

Hugo can also be a webserver, and I served this site like that for ages, but I have moved on to using Caddy for serving what hugo generates again. This way I have a somewhat more direct grasp of what exactly is served, which we will return to later. In order to get from a bunch of md files on my local machine to a blog in the 🌈cloud🌈 a few more steps need to happen. Firstly, we need to get this “source” into the cloud as well, so we can turn it into HTML in the right place. Secondly, we need to actually do that compute and also manage to get the results into the right place. Lastly, we need to display this hard work to the ever-distant public. For this we will “need” some tools, some source control, and some continuous integration!

Git, the information manager from hell 🔗

Since the contents of most programming projects are textfiles, programmers have gotten pretty good at managing them. While there are solutions today for collaborating live on text documents with others, the experience is nothing like using a source control system. Git is not the first such system to be developed, but has since its original release in 2005 become the default source control tool anywhere. Seeing as this project too is mainly a bunch of textfiles, it is a very useful tool in managing them. Repositories are the core of git, and can be looked at like a directory with a global undo/redo feature that anyone collaborating on a project can use. It tracks all the history of changes to every file, it tracks all the current branching changes to the directory, and it helps consolidate changes from various points in time to a cohesive whole.

Hugo very much prefers that you use git to manage the content files that it uses to generate the static site with, so it really isn’t that optional. Themes for hugo, like the one that the one this site uses is based off of, are managed as git submodules- a kind of a repository in a repository. This way you can both track your own content files in your git repository AND get all the new stuff for your theme from its upstream and not have them interfere with each other. Pretty neat, actually.

The toolchain 🔗

Now on to the most exciting bit, my custom cute little toolchain! I am the type that would rather implement something myself than having to learn the syntax of someone else’s perfectly functional tools. So it shouldn’t come as a surprise when I inevitably end up reinventing the wheel every now and again. But some of my creations in this toolchain are a bit more novel, and proportionately less useful. The same goes for my names, as i never really care to double check them once they pop into my head. All of these tools have their own code repositories, but are bundles ad submodules into a unified “tools” repository. This meta-repository is then used as a submodule in the repository that houses this blog, letting me keep it updated when I work with my tools, but not having to bother moving them an installing them everywhere. Of course there is a benefit to properly packaging and distributing your tools as well, but as before the scale of this project really doesn’t warrant that degree of complexity or effort.

Image manipulation with grimagoire 🔗

Grimagoire is a python script I wrote, with components that are years and years old, to do some basic image manipulation on the site. Currently, it has support for EXIF data scrubbing as well as ASCII-ification of images. EXIF data is a small data segment in image files that holds some metadata about that, which is super handy in a lot of situations. However, the metadata can also prove a privacy concern in situations where an image is shared. It can reveal data very directly, like a geographical location where a picture was taken, or less direct data like when. Most social media sites today will read such metadata of things you post there to gather some more information about you, the poster, then stripping it before allowing the public to access it. When you self host your images, however, there is no service to do that data stripping so you have to do it yourself! Enter Exifnt:

import PIL.Image
import PIL.ExifTags

class Exifnt():
    def __init__(self, img):
        self.img = img
    
    def get_exif(self):
        if not self.has_exif():
            return None
        exif = {
                PIL.ExifTags.TAGS[k]: v
                for k, v in self.img.info["exif"].items()
                if k in PIL.ExifTags.TAGS
            }
        return exif

    def has_exif(self) -> bool:
        if "exif" in self.img.info and self.img.info["exif"]:
            return True
        else:
            return False

    def rm_exif(self):
        tmp = PIL.Image.new(self.img.mode, self.img.size)
        tmp.putdata(self.img.getdata())
        self.img = tmp
        return self.img

Exifnt uses pillow, the likely most widely used python image management library, to work with images. It can simply check if an image has any EXIF data, which is useful in running automated tests, but it can also strip it. It does this quite simply by creating a new pillow image object and only moving the image data from the desired image to it. This simply skips copying the metadata which is subsequently thrown to the cosmic winds. This feature is called as a git pre-commit hook, i.e. it runs every time I try to add something to the repository, which scrubs out EXIFdata from images throughout the repo.

Grimagoire can also do some cute little image conversions, like turing an image into ASCII character art. So you can go from an image like this:

Example input

To a rasterized image like this:

Example rasterized output

Or a vector graphics output like this:

Example rasterized output

Which is just there for a fun effect, and if I ever upload images to the blog that are not screenshots of #HackerSoftware, I will likely be using this to stylize them.

Language controls with texchecks 🔗

Texchecks is an interesting little project that grew into a terrible spaghettified mess very quickly. I hadn’t worked with the OpenAI API for ages, but I knew that I wanted SOMETHING to check for errors in my texts beyond a simple spellcheck. Naturally I could get something like grammarly into my IDE, but that seemed dull so I wrote my own solution instead. Initially, I wanted it to work like pytest but that ambition turned from function to form very quickly as I realized how incredibly unreliable ChatGPT is. Sure it managed to consistently find all things like spelling and grammar errors in the text… but it also made some up just out of nowhere. I added a very abridged and horizontally trimmed output of what texchecks gives from reading this very text at the time that I am writing these words.

======== 1 spelling mistakes found in the text ========
============== | Line: 38 | char: 15-22 | =============
If I were more tryhard, I would offload all the non-tex 
               ~~~~~~~
Change to: try-hard
Reason: Hyphenation is required for compound adjectives.

========= 2 grammar mistakes found in the text ========
============== | Line: 45 | char: 23-39 | =============
Hugo very much prefers that you use git to manage the c
                       ~~~~~~~~~~~~~~~~
Change to: that you use Git
Reason: 'Git' should be capitalized as it is a proper n

============= | Line: 43 | char: 763-780 | ============
all the current branching changes to the directory, and
                ~~~~~~~~~~~~~~~~~
Change to: branches of changes
Reason: Preferable phrasing for clarity.

======= 1 formatting mistakes found in the text =======
============== | Line: 91 | char: 0-36 | ==============
### The workflow within the workflow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Change to: ### The Workflow Within the Workflow
Reason: Consistent capitalization in headings.

======= 2 word choice mistakes found in the text ======
============= | Line: 15 | char: 140-150 | ============
tructured textfiles and spits out rendered HTML. This m
                        ~~~~~~~~~~
Change to: processes
Reason: 'Processes' is a more formal and precise term.

============= | Line: 40 | char: 500-524 | ============
Secondly, we need to actually do that compute and also 
                     ~~~~~~~~~~~~~~~~~~~~~~~~
Change to: actually perform that computation
Reason: More precise terminology for technical context.

======= 1 general improvements found in the text ======
=============== Use of emoji is informal ==============
Suggestion: Replace emoji with text if formality is nee

=======================================================

It’s not great, I find. Some of the things it suggests would turn the text into the same mechanical slop that it itself generates. Big surprise, isn’t it? This makes the original idea of using this as a pre-commit dubious at best and it has since been relegated to the infernal position of being ran as needed. It never helped its case that it costs (literal nickels) to run as well, which adds scary scary cost to writing. But as LLMs are literally big text statistics boxes, it is pretty good at picking out where my texts have issues that my text-blindness renders me incapable of noticing by myself. Future development plans include some other text check things, like an actual spelling library perhaps, or something that capitalizes standalone “I"s, which my non-native self always fails to do. I am looking into Gemini as a replacement for ChatGPT as well, it just might not make stuff up about my texts.

Formalities with fixaform 🔗

More often than not I manage to write a post within a few hours, since I don’t start writing unless I WANT to write. But sometimes it takes me a while to get around to finishing a post, which reflects poorly in things like that little timestamp at the top of the page. Never fear, fixaform is here! Fixaform checks all the little things in the Hugo header, which is a yaml data structure and even fixes it where able. So if my date is more than a day old at time of publishing, it simply sets it to the current time which more than likely reflects when I got those final revisions done. This script, like exifnt, is called by the pre-commit hook so it runs every time I try and submit something to the git gods. It corrects the small inconsistences that bug me otherwise and makes that barrier to posting just a bit lower.

Signatures with siggy 🔗

The final component of this little toolchain for now is siggy, the signer. Siggy is a glorified shell script that I typed up one evening in python which recurses through a directory structure and signs all the files of a certain extension with GnuPG. PGP signing is basically THE way to prove that something was authored by someone, which isn’t perhaps critical for a blog so I’ve been lax on a few of the security features one should observe. The idea is that I can, with my super secret private key, generate a cryptographic proof based on a file. This proof, also known as a signature, can then be validated using a public key which anyone can access. So, with a file (or any lump of data), a signature of that file made with a private key, and the public key corresponding to that private key, you can mathematically prove that that private key signed that file. Now, if you can also prove that that private key belongs to a specific entity and only they have access to it then you can prove that that entity created a certain file.

Siggy’s role, then, is to use my private key to sign everything important on this blog, which is texts and images (though it can be extended to arbitrary file formats of course). You can go right now and grab the signature for this page from the footer, there you will also find the link to my public key with which you can verify that I wrote this text. In order to do so you would first download the public key which you can get with curl:

https://alisceon.com/gpg-key.pub

and the signature which can be found at:

https://alisceon.com/texts/blogwriters-workflow/index.html.sig

The last part you need is this page, which is… here!

https://alisceon.com/texts/blogwriters-workflow/index.html

Then you need to check that index.html.sig actually signs index.html with gpg-key.pub. So first import the key with:

gpg --import gpg-key.pub

Then verify the signature and data with:

gpg --verify index.html.sig index.html

As a simple script it would look something like this

#!/bin/bash
mkdir .tmp
cd .tmp
curl https://alisceon.com/gpg-key.pub -O
curl https://alisceon.com/texts/blogwriters-workflow/index.html.sig -O
curl https://alisceon.com/texts/blogwriters-workflow/index.html -O
gpg --import gpg-key.pub
gpg --verify index.html.sig index.html
cd ..
rm -rf .tmp

And naturally you can change the URLs for whichever post you want to check the signature of. To make the process a bit easier, I added the public key and signature to the footer of every page. Siggy can also verify the signatures of an “entire” domain (it only looks where the host asks it to, out of kindness) that follows the same format, which is how I can test and make sure the deployment went well.

There is an issue with this approach that needs to be remarked on: “how in the seven hells would you verify the key belonged to me?”. If someone steals control over the website they can just swap out the public key and re-sign everything! You only have one source of truth in this case, which is this site. This is not enough to actually trust a PGP key, and by extension its signatures. You need more sources of truth, and the more you have the higher degree of trust you can have. This is why people will sign each other’s keys, proving that they trust that key’s legitimacy. This creates a web of trust over time where a key is as legitimate as whoever has signed it with their key which is as legitimate as whoever has signed it and so on. My key is signed by… noone! At least noone that whose signature bears any weight, and it shouldn’t be as I have provided no proof of my control of it. This is more a deployment for fun and demonstrative purposes than an actual security feature for the blog, which is why i mentioned earlier that this is in no way critical. If you distribute any kind of executable, pretty please provide a signature with more robustness than this. In theory this adds a layer to protect against Man-In-The-Middle (MITM) attacks as well, but that is already damn near impossible given that this site will only allow HTTPS connections.

The workflow within the workflow 🔗

Now that we have looked at the stack of tools I use to make my work easier, we will have a quick look at how they actually run. Some have functionality, as mentioned before, that runs before a commit is made. This is handled through a git hook that is part of the tools repository, very simple. The next stage happens after things are pushed up to the remote repository however. Here we trigger what the GitHub-o-sphere calls a workflow or what the GitLab-o-sphere would call a CI/CD pipeline. It’s all just fancy words for a script that runs within the context of a repository. What I have done with this repo is highly standard, I have a protected main branch (which means it cannot simply be pushed to), and feature branches that need to be merged into it. The branch that this will be pushed on is called blogwriters-workflow as this post is the main content of it, which makes it easier to see which branch relates to which feature.

When pull request, i.e. a request to merge a feature branch with main is opened, my gitea repository triggers a workflow. This workflow runs a few tests to make sure that the contents of the feature branch are production-ready. This is super simple to wrap your head around when you hve traditional code to work with, as testing suites makes it very clear cut what is and isn’t ready to go “live”. But as previously mentioned, we mostly work with text files which are not quite as clear cut. This is where the tools return! Alongside their ability to make little corrections and suggest changes, they can also test my content. Grimagoire tests my images to make sure EXIF data is removed and fixaform looks for metadata inconsistencies. They will light up a big red X if they aren’t completely happy with the contents of the blog, much like pytest would with misbehaving python code. When they are happy, the workflow tries to build the site with Hugo just to see that it works. Lastly, siggy tries to sign all the files it needs to in the output directory- to make sure everything there works as well. This part could be called “Continuous integration”, if you stretch the definition a bit.

name: Test hugo site
on:
  push:
    branches:
      - '!main'
  pull_request:
    branches:
      - main
  workflow_dispatch:
defaults:
  run:
    shell: bash

jobs:
  test:
    runs-on: ubuntu-latest
    container:
      #I build my own hugo image because I was having issues with all 
      #the public ones and I'm a nerd who likes to build things.
      image: registry.alisceon.com/alisceondotcom/hugo
      credentials:
          username: ${{ vars.REGISTRY_USER }}
          password: ${{ secrets.REGISTRY_TOKEN }}
    steps:
      - name: Checkout
        uses: actions/checkout@v4
        with:
          submodules: recursive
          fetch-depth: 0
          path: '.'
          token: ${{ secrets.ACCESS_TOKEN }}
          github-server-url: https://git.alisceon.com

      - name: Install python deps
        run: |
          python3.11 -m pip install -r tools/requirements.txt          

      - name: List changed files
        run: |
          git diff --merge-base --name-only origin/main \
          | tee .changed-files          

      - name: Pre-build tests
        run: |
          python3.11 tools/grimagoire/grimagoire.py \
            --config tools/grimagoire/config.yml \
            --input .changed-files exifnt \
            --test
          python3.11 tools/fixaform/fixaform.py \
            --config tools/fixaform/config.yml \
            --input .changed-files          

      - name: Build with Hugo
        env:
          HUGO_ENVIRONMENT: production
          HUGO_ENV: production
        run: |
          mkdir ./public
          /usr/bin/hugo \
            --gc \
            --minify \
            --destination ./public \
            --baseURL https://alisceon.com          

      - name: Post-build tests
        #It isn't what I'd call AMAZING practice to store PGP keys even as 
        #secrets, but it is fine with my current threat model and layout. 
        #If someone has gotten access to my repo, I'm in for far worse of 
        #a time than any amount of illicit signing could impart.
        run: |
          echo "${{ vars.GPG_KEY_PUB }}" > gpg-key.pub
          echo "${{ secrets.GPG_KEY }}" > gpg-key.priv
          python3.11 tools/siggy/siggy.py ./public -sv 
          rm gpg-key.pub gpg-key.priv          

If the last commit in the branch that the pull request is open for clears all of those tests, then I can click the big nice merge button, and the feature branch is merged into the main one. When this happens, another workflow triggers: the deployment workflow. This workflow first creates a job that builds the site and then uploads an artifact to Gitea. The artifact that this job produces is a simple zip of the entire static site, which is nice to have laying around if ever needed. Another job then triggers which pulls that artifact right back down and extracts it. Then it runs siggy to sign all the files that needs signing right before rsyncing those files to the deploy directory.

name: Deploy Hugo site to Container
on:
  push:
    branches:
      - main
  workflow_dispatch:
defaults:
  run:
    shell: bash

jobs:
  build:
    runs-on: ubuntu-latest
    container: 
      image: registry.alisceon.com/alisceondotcom/hugo
      credentials:
          username: ${{ vars.REGISTRY_USER }}
          password: ${{ secrets.REGISTRY_TOKEN }}
        
    steps:
      - name: Checkout
        uses: actions/checkout@v4
        with:
          submodules: recursive
          fetch-depth: 0
          path: '.'
          token: ${{ secrets.ACCESS_TOKEN }}
          github-server-url: https://git.alisceon.com

      - name: Build with Hugo
        env:
          HUGO_ENVIRONMENT: production
          HUGO_ENV: production
        run: |
          mkdir ./public
          /usr/bin/hugo \
            --gc \
            --minify \
            --destination ./public \
            --baseURL https://alisceon.com          

      - name: Upload site artifact
        uses: actions/upload-artifact@v3
        with:
          name: hugo-snapshot
          path: ./public
          retention-days: 90

  deploy:
    needs: build
    runs-on: hugo-deploy
    env:
      HUGO_VERSION: 0.124.0
    steps:
      - name: Add dependencies
        run: |
          apk add --update npm rsync python3 py3-pip gnupg          

      - name: Checkout
        uses: actions/checkout@v4
        with:
          submodules: recursive
          fetch-depth: 0
          path: './source'
          token: ${{ secrets.ACCESS_TOKEN }}
          github-server-url: https://git.alisceon.com

      - name: Get tools
        run: |
          mv ./source/tools ./tools          

      - name: Install python deps
        run: |
          pip install --break-system-packages \
            -r ./tools/requirements.txt          

      - name: Get artifacct
        uses: actions/download-artifact@v3
        with:
          name: hugo-snapshot
          path: ./staging

      - name: Sign off
        run: |
          echo "${{ vars.GPG_KEY_PUB }}" > gpg-key.pub
          echo "${{ secrets.GPG_KEY }}" > gpg-key.priv
          python3 tools/siggy/siggy.py ./staging -sv 
          rm gpg-key.pub gpg-key.priv          

      - name: rsync out
        run: |
          rsync -rv --delete-before ./staging/* /deploy
          chmod -R 755 /deploy/*          

But wait I hear myself ask, how does that actually end up where Caddy can get the files? See, here is where we get a little bit tacky. The last job runs on a runner with the label “hugo-deploy”, of which there is only one and that runner liver right next to the caddy container. Whereas my ubuntu-latest runs in a docker-in-docker setup that is very well segmented from the host, this one ain’t. It only ever gets this exact job to run, and nothing else which limits it’s exposure to potential external hijinks. Unless alpine/python/npm/rsync/etc gets hijacked (at which point my little cloud image is the least of my worries) or my Gitea is compromised (which means my entire server is anyways), this is going to be predictable in its execution. As with earlier, this is a compromise that I am willing to make for this deployment that I would not necessarily do in other cases.

The sausage has now been made! 🔗

So as you are reading this, these little words have flown through my fingers and into this markdown file. That markdown file has then been hammered by various scripts to get it into a proper shape to be pushed up to a remote repository where it was yet again tested. As it cleared those tests it was transformed into a browser-friendly format and deployed to a webserver that is exposed to you and the rest of the public. What a journey! Writing it out, it sounds needlessly complex but as I am writing it out I am also noticing how little work I am having to do. Once set up, systems like these will save you so much time tha you get a return on your investment very quickly. You can get that return far quicker by being less stubborn than me and looking for tools in the wild that do what you need done with far less invested effort. I would likely go that route if I were paid for my time but, alas, I am not and this is merely for my own enjoyment.