Adding Key Phrases to Jekyll Blog Posts - The Offline Edition
Now OpenAI’s ChatGPT3 is the hottest ticket in town, it’s about time to call it what it is:
A white guy in an over-priced suit, impersonating every bubble imaginable, continuously spuing utter fascist crap into the world without bothering to hide its blatant biases and presenting everything as ‘The Truth’. I would not want to drink a beer with it if it was a person. I would want it incarcerated indefinitely.
So let us just appreciate a lot of the ‘offline goodness’ and ‘limited scope’ that thankfully still exists in the AI world and add just one more New Year’s resolution: please stop feeding proprietary online AI monstrosities, like the one described earlier, because they are owned by insane, self-interested megalomaniacs, do not let the name ‘OpenAI’ fool you…
Strangely enough, those ‘monstrosities’ do have a lot in common with ‘him’ that bought Twitter not so long ago or ‘the one’ that thinks a metaverse represents a true utopia, IMHO.
Key phrases - the offline edition
Get inspired by auto-tagging key phrases
While we can not look into the OpenAI ChatGPT model, because ‘it is too large’, ‘we wouldn’t understand what we are looking at’ and most of all ‘how would we make money otherwise?’, there are excellent open source alternatives for Key Phrase extraction out there.
So let me show you something practical.
This post may be interesting for all of you who are blogging using a Static Pages solution and want to have some ‘Key Phrase suggestions’ called ‘tags’ inserted, on-demand, and offline, into the post you are actively editing.
In my case, I am using a Jekyll static site generator hosted on GitHub Pages and I write my posts in Markdown syntax with a YAML frontmatter section in each ‘post’ file. This frontmatter contains metadata about the post like the ‘title’, ‘date’ and most importantly, the ‘tags’. This metadata will be rendered appropriately when the pages are built and requested into a browser. The markdown source of a post will look something like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
---
title: My first blog post
date: 2023-01-01 10:00:00+02:00
header:
og_image: /assets/images/post-my-first-blog-post-001.png
categories:
- blog
tags:
- FirstBlog
- Post
- ANiceThing
---
## This is my first blog post
Just writing along, is nice thing to do!
Wouldn’t it be nice to get some inspiration for the tags in your post?
VSCode: the front-end for all your workflows
First off, read all the documentation on embedding the python script add_keyphrases_to_jekyll_blog_post into your personal VSCode workflow on my GitHub source repository. You will need to download the source code and install some python packages using pip. Be warned, the models take up quite some disk space. Of course, you could also take the code and adjust it to fit your particular workflow.
I use VSCode as an authoring tool for these posts. The cool thing about VSCode is that you can make it exactly fit your specific workflow by using the built-in ‘Tasks’ and ‘Keyboard Shortcut’ functions.
The repository with the folder add_keyphrases_to_jekyll_blog_post has an nlp.json file that contains the configuration for the main NLP tools. Furthermore the tasks.json and keybindings.json files will allow you to execute the Task (ie. executing the python script) using a Keyboard Shortcut in VSCode. In this case I have configured the CTRL+SHIFT+F10 key combination.
First copy and paste the text of the keybinding JSON bit into your own ‘User’ key bindings JSON file. Then you will need to have a post active in the VSCode editor when using the key combination.
Taste the NLP magic sauce
The python script that is launched, add_keyphrases_to_jekyll_blog_post.py, employs state-of-the-art, offline, NLP python packages like KeyBERT, using the default model called ‘all-MiniLM-L6-v2’, which is being fed by a ‘vectorizer’ using a Spacy language model for KeyphraseVectorizers. This and the number of key phrases it should deliver are configured in the nlp.json file.
That all sounds pretty technical, but it just means that in a few lines of python code, you can feed the text content of your blog to a NLP model and the output will be a user-defined number of key phrases representing your post.
Included in the mix is the application of a specific HTML scraping library called ‘Beautiful Soup’, which is excellent for scraping and tidying up textual input.
Yes, you heard it here first: the execution of the input scraping and cleaning AND the key phrase extraction task AND the use of the needed language models can all be done offline, away from prying eyes and for free under an MIT/BSD licenses123.
NOTE: When a ‘tags’ key already exists in your frontmatter, the script will ask if you want to overwrite these tags. So be sure to check the terminal output in VSCode.
References
The necessary reference articles to KeyBERT and KeyphraseVectorizers are provided as an answer on StackOverflow to the question “Gpt 3 keywords extractor”.
Furthermore I used this great article on Medium to gain insight on the NLP techniques I could use to fit my ‘Blogging workflow’.
Beautiful Soup is the go-to library to scrape the plain text from your markdown/HTML content.This library is also used extensively in one of the greatest free Python Programming eLearning efforts called ‘Python for Everybody, which I wholeheartedly recommend checking out. Learn Python before ChatGPT gets any better at it!
I want to stress that without these free/freemium sources I wouldn’t have come up with this solution in the first place, so I want to thank the free portion of the internet that is under great threat from proprietary technologies like ChatGPT and wish everyone a great, free, peaceful and educational 2023.