blog: New blogpost! "idea: the regex handbook"
Some checks are pending
Deploy website / build-and-deploy (push) Waiting to run
Some checks are pending
Deploy website / build-and-deploy (push) Waiting to run
Signed-off-by: SindreKjelsrud <sindre@kjelsrud.dev>
This commit is contained in:
parent
9f5c345f00
commit
562fda744b
1 changed files with 153 additions and 0 deletions
153
src/content/blog/idea-the-regex-handbook.md
Normal file
153
src/content/blog/idea-the-regex-handbook.md
Normal file
|
|
@ -0,0 +1,153 @@
|
|||
---
|
||||
title: "idea: the regex handbook"
|
||||
pubDate: "Dec 14 2025"
|
||||
description: ""
|
||||
draft: false
|
||||
---
|
||||
|
||||
After a lot of recent regex problems at work I feel like I've finally gotten a little hang of it, well at least the C# and ECMAScript flavours. But, one thing a colleague of mine and I discussed was the problem of all the edgecases you get when dealing with all the different rules/laws.
|
||||
|
||||
Say for example a new business in Norway is going to be registered. It would need to send in an [Altinn form](https://info.altinn.no/skjemaoversikt/) with all the details. That Altinn form would/should probably have some restrictions for the input fields to remove possible user errors, and this is where the many different rules for naming the business happens to be a "problem".
|
||||
|
||||
The business name, in Norway atleast, would for example need to consist of three letters from the Norwegian alphabet, it shouldn't contain the name of a country, county, or municipality, and there are certain special laws that puts limits on the right to use defined terms in the name as well (e.g. "bank", "apotek", "børs"). [Here's a list of the laws](https://lovdata.no/dokument/NL/lov/1985-06-21-79/KAPITTEL_2#%C2%A72-2), as I know there's more scenarios than I've mentioned - like you can see in [this blogpost as well](https://enklerestart.no/blogg/hva-skal-selskapet-hete/).
|
||||
|
||||
We ended the discussion with the thought of a regex handbook of sorts. Like there should be a register of all the unique cases, showing each case with a description and the combined regex with it. It would make it much more easier for the developers to find the correct regex for the specific regex flavour. As I've not found something like this yet, I wanted to write a blogpost about it sharing the idea and some of the regexes I've come up with/found.
|
||||
|
||||
## A quick and simple regex 101 (atleast for some of the common flavours)
|
||||
|
||||
> _PS: I can't recommend using [regex101](https://regex101.com) enough for the trail and error testing that comes with regex._
|
||||
|
||||
When building a regex `^` marks the beginning of a string, while `$` marks the end of a string. For example, you could have `^[a-zA-Z]+$` - which matches only strings that consist of one or more letters in the "a" to "z" and "A" to "Z" range. The `+`-symbol here makes it so the regex allows more than one character.
|
||||
|
||||
But, the example above excludes a lot of languages, e.g. the Norwegian language, which has an additional three letters - "Æ", "Ø", "Å". Here you'd have to either add them to the character set, e.g. `[a-zæøåA-ZÆØÅ]`, or you could use predefined character classes like the [Unicode character property](https://wikipedia.org/wiki/Unicode_character_property) class, e.g. `\p{L}`.
|
||||
|
||||
The last example would seem like the easy, and best, solution, but depending on the use-case it could not be the best match for that specific use-case. In a Norwegian context, allowing for example the Spanish letter "ñ", would not be allowed as a business name in [Brønnøysundregisteret](https://brreg.no) due to their rules. An easy way to check out this is by using [navnesok.no](https://navnesok.no), which helps finding out if the business name is allowed or not.
|
||||
|
||||
Again, the user error could still happen, as not all users would check out this site for their business before submitting their application form. Therefore, a good regex is necessary. And maybe even different ones for different fields, like personal names vs. business names.
|
||||
|
||||
This shows another thing to think about, which is to be sure you include the whole country and its population. Like in Norway, where we've got the Sami - _the indigeneous people of Sàpmi, a region spanning parts of Norway, Sweden, Finland and Russia_ - which we would need to allow their alphabet aswell to not exclude them. Additionally, we wouldn't want, in some cases, to keep the users from using their full name - that includes foreign letters.
|
||||
|
||||
As you see, there are a lot of different scenarios to keep in mind, and therefore a lot of different regexes to know of. Therefore, I someday hope there will be some sort of regex-handbook for all the different scenarios.
|
||||
|
||||
Here's some additional cases to the ones above, which can be added to that handbook at least. (_Atleast after they been combed through by a professional lol I'm still not that confident in my regexes_)
|
||||
|
||||
## The (current) Regex Handbook
|
||||
|
||||
### Norwegian phonenumbers (_adjustable_)[^1]
|
||||
|
||||
> _Can be adjusted to other countries_
|
||||
|
||||
**Use-case:** When you want to have a strict input field for phonenumbers only allowed to Norwegian (or other) countries.
|
||||
|
||||
**Rules:**
|
||||
|
||||
- Input can start with either 0047 or +47 (landcode is not required)
|
||||
- Phonenumbers need to start with either 4 or 9
|
||||
- Total of 8 numbers, with nothing after those eight
|
||||
- It's not allowed with spaces in the phonenumber
|
||||
|
||||
**Regex:** `^((0047)?|(\+47)?)[4|9]\d{7}$` or `^((0047)?|(+47)?)[1-9] ?\d ?\d ?\d ?\d ?\d ?\d ?\d$`
|
||||
|
||||
---
|
||||
|
||||
### 70+ European (& some African) characters[^2]
|
||||
|
||||
**Use-case:** When you want to allow more characters than English alphabet without allowing the whole Unicode set.
|
||||
E.g. let’s say you have users from Europe so that you need your regex to accept European languages such as German, Italian, Spanish, Portuguese, Danish, Swedish, Irish, Albanian and more.
|
||||
|
||||
In short, we want to allow most European characters.
|
||||
|
||||
**Characters:** `ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿıŒœŠšŸŽž`
|
||||
|
||||
**Regex:**`^[a-zA-Z\u00c0-\u017e]$`
|
||||
|
||||
---
|
||||
|
||||
### The Sami languages within Norway[^3]
|
||||
|
||||
**Use-case:** As mentioned above, in Norway we've got the indigenous people _"Sami"_, which has around ten languages - of which three are the most dominant in Norway. They also use some of the special characters for the country their within, e.g. "Æ" in Norway and "Ä" in Sweden, so they've got the special characters added to the regex for that reason.
|
||||
|
||||
In short, we want to allow the three dominant Sami languages, Southern-, Lule-, and Northern Sami.
|
||||
|
||||
- **Characters:**
|
||||
- Southern Sami: `Ïï, Öö, Åå, Ææ`
|
||||
- Lule Sami: `Áá, Ŋŋ, Åå, Ææ`
|
||||
- Northern Sami: `Áá, Čč, Đđ, Ŋŋ, Šš, Ŧŧ, Žž`
|
||||
- **Regex:** `^[a-zæøåïöáŋčđšŧžA-ZÆØÅÏÖÁŊČĐŠŦŽ]+$`
|
||||
|
||||
---
|
||||
|
||||
### Zipcode in Norway[^1]
|
||||
|
||||
**Use-case:** Validation Norwegian zip codes, which are 4 digits long - ranging from `0001` to `9998`. The numbers `0000` and `9999` are not to be used.
|
||||
|
||||
- **Regex:** `^(000[1-9]|0[1-9][0-9][0-9]|[1-9][0-9][0-9][0-8])$`
|
||||
|
||||
---
|
||||
|
||||
### Social security number in Norway
|
||||
|
||||
**Use-case:** Validating Norwegian social security numbers. These numbers follow a specific date format (DDMMYY) followed by a 5-digit personal number. There are two common regexes depending on whether a separator is allowed.
|
||||
|
||||
- **Regex (with optional separator):** `^(0[1-9]|[1-2][0-9]|31(?!(?:0[2469]|11))|30(?!02))(0[1-9]|1[0-2])(\d{2})(.?)(\d{5})$`
|
||||
- **Regex (without separator):**`^(0[1-9]|[1-2][0-9]|31(?!(?:0[2469]|11))|30(?!02))(0[1-9]|1[0-2])\d{7}$`
|
||||
|
||||
---
|
||||
|
||||
### Organization number in Norway[^1]
|
||||
|
||||
**Use-case:** Validating Norwegian organization numbers. They are 9 digits long and may optionally include spaces or dots as separators.
|
||||
|
||||
- **Regex:** `^[0-9][0-9][0-9][\s\.]?[0-9][0-9][0-9][\s\.]?[0-9][0-9][0-9]$`
|
||||
|
||||
---
|
||||
|
||||
### H-number in Norway
|
||||
|
||||
**Use-case:** Validating Norwegian H-numbers. These are a specific type of national identity number where the birthdate is modified by incrementing the third digit with 4.
|
||||
|
||||
- **Regex:** `^(0[1-9]|[1-2][0-9]|31(?!(?:0[2469]|11))|30(?!02))([0-9][1-9]|1[0-2])(\d{2})(\s?)(\d{5})$`
|
||||
|
||||
---
|
||||
|
||||
### Bankaccount-number in Norway[^1]
|
||||
|
||||
**Use-case:** Validating Norwegian bank account numbers. These are 11 digits long, cannot start with 0, and can have different formatting (with dots, spaces, or no separators).
|
||||
|
||||
- **Regex (with dots):** `^[1-9]\d{3}\.\d{2}\.\d{5}$` (e.g. `1111.22.33333`)
|
||||
- **Regex (with spaces):** `^[1-9]\d{3}\ \d{2}\ \d{5}$` (e.g. `1111 22 33333`)
|
||||
- **Regex (no separators):** `^[1-9]\d{10}$` (e.g. `11112233333`)
|
||||
|
||||
---
|
||||
|
||||
### Registrationsnumber for cars in Norway[^1]
|
||||
|
||||
**Use-case:** Validating various types of Norwegian vehicle registration numbers, including personal cars, buses/trucks/motorcycles, and more general plates with flexible formats.
|
||||
|
||||
- **Regex (Personal car, no spaces, case-insensitive):** `^[A-Z,a-z]{2}[1-9]{1}\d{4}$`
|
||||
- **Regex (Personal car, with spaces, case-sensitive - uppercase only):** `^[A-Z]{2}[1-9]{1}\d{4}$`
|
||||
- **Regex (Bus/Truck/MC, no spaces):** `^[A-Z]{2}[1-9]{1}\d{3}$`
|
||||
|
||||
---
|
||||
|
||||
### D-number in Norway
|
||||
|
||||
**Use-case:** Validating Norwegian D-numbers. These are a type of national identity number for foreign citizens, where the first digit of the birth date is incremented by 4 (_basically the same as H-numbers, just using the first instead of third digit_).
|
||||
|
||||
- **Regex:** `^([1-9][1-9]|[1-2][0-9]|31(?!(?:0[2469]|11))|30(?!02))(0[1-9]|1[0-2])(\d{2})(\s?)(\d{5})$`
|
||||
|
||||
---
|
||||
|
||||
[^1]: [Epinova](https://www.epinova.no) wrote a [good blogpost about Norwegian regex](https://www.epinova.no/folg-med/blogg/2020/regex-huskeliste-for-norske-formater-i-episerver-forms), which I collected some of the cases above from.
|
||||
|
||||
[^2]: [How to allow European characters in text fields by using regular expression?](https://port135.com/how-to-allow-european-characters-in-text-fields-by-using-regular-expression/)
|
||||
|
||||
[^3]:
|
||||
[Samiske språk | NDLA](https://ndla.no/r/norsk-sf-vg1/samiske-sprak/27bdb6ce15) and [
|
||||
Innføring i de samiske språkene](https://samiskeveivisere.no/innforing-i-de-samiske-sprakene/) helped me with the Sami languages.
|
||||
|
||||
<!--
|
||||
https://info.altinn.no/starte-og-drive/starte/valg-av-navn/#krav-til-navn-p%C3%A5-foretaket
|
||||
https://enklerestart.no/blogg/hva-skal-selskapet-hete/
|
||||
https://www.brreg.no/registersok/
|
||||
https://blog.golimb.com/2023/02/08/norwegian-regex-examples/amp/
|
||||
Loading…
Add table
Add a link
Reference in a new issue