├── 2021 └── 12 │ └── 14 │ └── doing-more-than-one-thing.md ├── 2022 ├── 4 │ └── 29 │ │ └── notes-on-a-lateral-career-move.md └── 6 │ └── 15 │ └── rolling-your-own-crypto-aes.md ├── 2023 └── 3 │ └── 1 │ └── building-a-jank-uart-cable-from-scavenged-parts.md ├── 2024 ├── 5 │ ├── 10 │ │ └── cordic.md │ └── 29 │ │ └── fast-inverse-sqrt.md └── 11 │ ├── 1 │ └── sending-an-ethernet-packet.md │ └── 26 │ └── getting-an-ip-address.md ├── .github └── workflows │ └── rss-feed.yml ├── .gitignore ├── README.md ├── assets ├── AES-Block.png ├── AES-CBC.png ├── AES-Key_Schedule_128-bit_key.png ├── AES-Padding-Blocks.png ├── AES-RotWord.png ├── AES-s-box.png ├── AES-shift-rows.png ├── ArduinoRedGreenLEDs.jpg ├── PinoutArduinoUno.png ├── StackFrames-Improved.png ├── StackFrames-Save.png ├── StackFrames-Wasteful.png ├── Tux.png ├── Uno.png ├── cordic │ ├── 0.png │ ├── 0r.png │ ├── 1.png │ ├── 16.png │ ├── 1r.png │ ├── 2.png │ ├── 2r.png │ ├── 3.png │ ├── 3r.png │ ├── 4.png │ ├── binary-search.png │ ├── cordic.gif │ ├── fixed-point.png │ └── fixed-whole-fractional.png ├── fast-inverse-sqrt │ ├── floats-as-ints.png │ ├── infinity.png │ ├── log2-vs-ints.png │ ├── log2.png │ ├── nan.png │ ├── normal.png │ └── subnormal.png ├── high-level-state-diagram.png ├── serial │ ├── arduino-duemilanove.jpg │ ├── cable-complete.jpg │ ├── cable-cut.jpg │ ├── cable-tinned.jpg │ ├── chip-removed.jpg │ ├── chip-soldered.jpg │ ├── ft232rl-pinout.png │ ├── ft232rl-terminal.png │ ├── smd-breakouts.jpg │ ├── usb-charger.jpg │ ├── usb-soldered.jpg │ ├── with-arduino.jpg │ └── with-headers.jpg ├── tux.cbc.png ├── tux.enc.png ├── tux.gif └── w5100-project │ ├── architecture.png │ ├── bodge-wires.jpg │ ├── connected-interfaces-dhcp.png │ ├── dhcp-state-machine.png │ ├── first-actual-packet.png │ ├── first-unsuccessful-packet.png │ ├── shield.jpg │ └── with-saleae.jpg ├── feed-builder ├── index.ts ├── package-lock.json ├── package.json └── tsconfig.json ├── feed.xml └── generate-feed.sh /.github/workflows/rss-feed.yml: -------------------------------------------------------------------------------- 1 | name: Update RSS feed 2 | on: 3 | push: 4 | branches: 5 | - 'main' 6 | jobs: 7 | rss: 8 | name: Update RSS feed 9 | if: ${{ !contains(github.event.head_commit.message, '#update-rss-feed') }} 10 | runs-on: ubuntu-latest 11 | steps: 12 | - name: checkout 13 | uses: actions/checkout@v2 14 | 15 | - name: node 16 | uses: actions/setup-node@v1 17 | with: 18 | node-version: 16.x 19 | 20 | - run: | 21 | cd feed-builder 22 | npm i 23 | npm i -g ts-node 24 | - run: sh ./generate-feed.sh 25 | - run: | 26 | git config user.name "GitHub Actions Bot" 27 | git config user.email "<>" 28 | - run: | 29 | git add feed.xml 30 | git commit -m "#update-rss-feed" 31 | git push origin main -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | node_modules 2 | in-progress -------------------------------------------------------------------------------- /2022/4/29/notes-on-a-lateral-career-move.md: -------------------------------------------------------------------------------- 1 | # Notes on a lateral career move 2 | 3 | I changed jobs recently. That's not all that uncommon for devs - in fact it's accepted wisdom that your best course of action is to switch every 2 years or so in order to maximise your salary. That may or may not be true, but it's not really the route I've taken. What was perhaps a little more unusual about my change was that I made a somewhat lateral move between fields of software engineering, from mostly web-based backend/services to embedded firmware. What I experienced during the process may be of interest to others who are thinking about overcoming the inertia of making a change, or who just want to see the good, the bad, and the ugly of the job search and interviewing merry-go-round. 4 | 5 | It's a bit of a winding story, but covers fun topics like: 6 | 7 | - My disdain of recruiters 8 | - Elitism and gatekeepers 9 | - Understanding what is *actually* important when making tough choices 10 | - Looking for the signs that something is on fire 11 | 12 | ## Agency work 13 | 14 | I'll start by briefly addressing why I wanted to make a change. I haven't really job hopped in my career. I worked 2 years at a startup straight out of university building a 2D JS based game engine for a company that builds gamified medical device trainings/simulations. After that I joined an agency as a general purpose software engineer, consulting with many different clients on all kinds of projects (and levels of experience with the process). I did that for 5 years, and would honestly recommend it to new developers as it builds your skills in every area (including the all important client facing soft skills). There I worked on everything from electronic patient record / healthcare interaction prototyping for one of the worlds largest tech companies, to interactive research applications for academic institutions, a private futures trading platform spanning multiple exchanges, to an IoT platform doing everything from the PCB layout to writing the backend data ingest pipelines, plus a whole lot more that I can't possibly cover here. 15 | 16 | Agency work is fantastic for the breadth of experience you can attain, but it's also a double edged sword; you have no true ownership over the projects you work on. With some clients, you'll need to guide them through every step - handling the project management side of things and presenting decision moments to them in an understandable way. With others, they'll know (or think they know) what they want, and refuse any advice or suggestion. This can be frustrating when you know that their insistence will create a worse product, experience, or will simply have a negative effect on the deadline. After a few years of doing this, a lot of devs will find these issues too frustrating, and seek out a position where they can work on something long-term, and with real (shared) ownership over the product. This was a large part of my reason for leaving, but it wasn't the only one. 17 | 18 | ## Low level passion 19 | 20 | I am one of those programmers you hear about that just loves programming, and I pretty much always have been (or at least, since I saw the 1995 classic *Hackers* and was captivated). I am, at some level of my consciousness, always thinking about code. I don't claim that to be healthy, or something developers should aspire to in order to succeed - but it is how it is for me. I taught myself most of what I know, spending countless hours in the local library to make use of their two computers (probably to the dismay of the adults wanting to actually get things done). 21 | 22 | I am always reading, watching, and exploring new spaces in development. The process of understanding something and building it from scratch has become a bit of an obsession. At the agency the lingua franca was JavaScript (and eventually TypeScript), but we could and would use other languages when it made sense. Because of this, I developed a proficiency in the language, and would take concepts I read about or seen in other domains and implement them in JS. 23 | 24 | At some point, I developed a fascination with the lower levels of the stack. I got interested in building virtual bytecode machines, and as anyone who has gone down that road will know, ended up learning about parsers, compilers, assemblers, and the like. I was blogging a lot at the time, not for fame or fortune, but because I wanted to share the things I'm fascinated by with others. After a while I decided to start a YouTube channel in order to communicate low level concepts to JS developers called [Low Level JavaScript](https://youtube.com/lowleveljavascript). I've made videos on that channel about virtual machines, assemblers, parsers, emulators, digital logic, writing userspace USB drivers, speaking hardware protocols like SPI, and a lot more. No hard sell or anything, but you should really check it out 😉 Over the years of making the channel, I've delved deep into a whole bunch of areas, including computer architecture, and kindled a love for microcontrollers and embedded software. 25 | 26 | At some point, I found myself losing interest in the kinds of projects we were picking up at work. I had been fortunate enough to have had first pick when the embedded projects came around, but they were relatively few and far between. Not to mention I was really the only one in the company who was so inclined - and I know that you level up best when you're around other people who know a lot more than you do in a particular subject area (*"if you're the smartest person in the room, then you're in the wrong room"* and all that). So I made a decision to leave, and look for something specifically in embedded firmware development. 27 | 28 | ## Stress, Anxiety, and Imposter Syndromefd 29 | 30 | When I handed in my notice, I didn't have another job lined up. My daughter had been born about a year before, and as I don't need to tell any new parents out there, it takes a lot of your energy reserves! Between family life, a pretty demanding job, and knowing that I would have to focus some time on interview question preparation, I simply didn't have the mental capacity to search for, apply to, and interview for jobs alongside my regular work. 31 | 32 | Being in the position to even make a choice like this is a privilege, and I understand that not everyone can. Still, this was quite a barrier to overcome. My girlfriend and I talked a lot about the risks involved, the financial implications, and landed on a timeline and a plan for how to proceed. I had 6 weeks from the first day of freedom to focus purely on embedded positions, after which I would begin to include positions that were essentially what I was already doing. Some of my friends and former colleagues reach out from time to time to ask if I feel like jumping ship, so I figured that I'd be able to land on my feet if the embedded search fell apart. 33 | 34 | I had a month of notice period to fulfill (fairly typical in Europe) - winding down my involvement in projects, writing documentation, etc - and having made the step kind of gave me a renewed energy, and a sense of purpose. Although I had actually begun reaching out into my network and even had a couple of appointments arranged, I started to develop a little anxiety; Questioning whether I had made the right choice. *Perhaps you're just not meant to fully enjoy your work* I thought (it is work after all!). Maybe I'd made a mistake. 35 | 36 | On top of that, for the first time in my career, I started to feel some imposter syndrome. Who was I, someone with no formal education in the field, to think that I could work in the embedded space? Embedded, the field that almost all programmers hold in high regard because of the need to adhere to the ancient principles of actually worrying about how much memory and clock cycles you have available at your disposal. Maybe I'd completely overestimated the skills I'd been teaching myself over the last few years - and anyone actually working in the industry would laugh me off as an amateur hour devboard cosplayer! 37 | 38 | But having already jumped off the deep end, I tried not to dwell on it. 39 | 40 | ## Treating the job search like work 41 | 42 | I'm writing this article in 2022, so for any readers in a hopeful future, this was during corona time. I'd spent almost 2 full years working from home, and I had a rhythm. I knew that I needed to keep that rhythm if I was going to have a real chance of finding something within the allotted timeframe, so I continued to wake up as usual, get ready for the day, and head up to the home office, which I affectionately call my "lab", but my girlfriend calls "de zolder" (the attic). I split my day between: 43 | 44 | - **Searching**: Finding interesting positions and queueing them for later 45 | - **Applying**: Writing cover letters, customising CVs, and filling in applications 46 | - **Interview Prep**: Reading and practicing data structures and algorithms questions 47 | - **Projects**: Working on embedded projects that I could talk about and showcase in interviews 48 | 49 | The first two felt pretty obvious. I found it useful to split the process as it allowed me to keep my mind "in gear" and reduced context switches. The interview prep part was something I was kind of dreading - knowing that companies famously suck at hiring, and as such, ask completely obtuse puzzle questions, expecting you to have spent time *grinding leetcode* (shudder). It's not even that I don't like this kind of stuff, or that I think it's useless - I was most worried about not having prepped something before hand being my downfall, and the interviewer not actually wanting to see me "work through the problem". 50 | 51 | The final part - the projects - are something I don't think a lot of people do. I set aside a good chunk of time to really work on stuff that I knew I could talk about with confidence, and would cover key points in the kinds of jobs I was looking at. The main project I was working on during this time was an STM32 based GameBoy emulator in C, reading games and other data from an SD card, and rendering out to an OLED screen, with from-scratch drivers for both. I'd previously written a GB emulator in 2021, and so understood the system pretty well. I could also port the general emulator logic pretty easily, but there were problems to solve in the constrained environment like RAM limitations and general clock speed. The different approaches I was implementing to solve these problems served as a point of conversations in interviews. 52 | 53 | Whenever I would actually get an interview, I would allot specific time to prepare, doing company research, and trying to figure out how best to present myself in the context of what was mentioned in the job description. 54 | 55 | ## First hit: An (embedded) Agency 56 | 57 | I wanted to move away from agency work, but via a contact of my girlfriend, I ended up interviewing with an agency that specialised in embedded. During the first interview, I got a tour of the office, a showcase of some of the projects they had done, and finally a talk with the engineering manager. It actually went really well, and I felt a professional click. The projects were interesting and diverse, and whole place seemed filled with smart people. I emphasised that I was an autodidact coming with passion, but perhaps a lack of experience - but I was sure that I could get up to speed quickly with anything they threw at me. This turned out to be a mistake, but I'll get to that in a moment. 58 | 59 | I did get a second interview, which was a week after the first, this time with two senior engineers. It was pretty much downhill from the beginning unfortunately. I live in the Netherlands, and while I can speak Dutch, I have never done so in a professional context (let alone a *technical* context!). I was told by the engineering manager that it was an English speaking company (this is common for tech companies in Europe), but it was of course a bonus that I would be able to talk to everyone in their native tongue over coffee. So when the two engineers introduced themselves in Dutch, and asked which language I preferred, I repeated: English is my preference for work, but I can speak Dutch if needed. One of the engineers, who I will call `A`, said "Let's do it in Dutch, and you can always fall back to English words if needed". 60 | 61 | Perhaps that's reasonable. From my point of view, this interview was a hugely important moment where I was determined to make the best possible impression, and sell my skills and value to the company. To all of a sudden have to do that in my second language, without any kind of preparation, completely knocked my confidence. While they explained their backgrounds, I was spending 2x the regular amount of brain power to make sure I *fully* understood every word being said, all while trying to look casual and assured. When it came to my turn, I started speaking, but couldn't find the words fast enough. I was conscious of the struggle, which only made things worse. I stopped, and said that for me English would be better for this meeting, and I saw `A` check a mental checkbox. 62 | 63 | The other engineer, `B`, was open - asking questions and listening to the responses. `A` asked questions as if trying to trap me. The strangest part about the whole thing is that I was never asked any specific technology questions, no algorithms or data structures, nothing concrete. The questions were things like: 64 | 65 | - *"How well do you know C++?"*, 66 | - or *"How quickly can you read and understand a datasheet in order to write a driver?"* 67 | 68 | There are numerous problems with questions like this: 69 | 70 | 1. They are subjective to a fault. If I say I'm really good at C++, what does that mean exactly? Does it mean the same as if another person said exactly the same thing? What is the complexity of the part I'm dealing with when reading the datasheet? 71 | 2. Your cultural background will have a large influence on how you answer these kinds of questions. The Dutch are direct, but I'm from the UK, where we're typically modest about proclaiming how amazing we are. Other countries and cultures are even more reserved in this way. It's a genuine challenge to answer in a way that feels honest and comfortable. 72 | 73 | I tried my best to be charming and actually provide answers to the questions I though were being asked under the surface: 74 | 75 | - *"Can you competently write code?"* 76 | - *"Can you work alone to figure things out?"* 77 | - *"Do you understand timeboxing, estimation, splitting work into chunks etc?"* 78 | 79 | Unfortunately the click of the first interview just wasn't there, and I could feel that I wasn't convincing `A` - and it seemed that `A` was the decision maker. What was also apparent was `A`'s disdain for JavaScript, the web, and engineers who did not study EE or embedded at a masters level. 80 | 81 | I had unfortunately painted myself as someone with no real experience, and the interview did not afford me the opportunity to disprove that. I got a call not too long later explaining that they didn't want to continue. It was upsetting for a day or so, but I tried my best to take it in stride and keep going. I wrote an email to the engineers and the engineering manager asking for slightly more in-depth feedback, and what they thought I could do to improve my chances for the next opportunity. I actually even inserted practical example suggestions I was already thinking of myself to make it as easy and friction free for them as possible. I received radio silence in return, until 5 weeks later, when I got an email from `A` that "it was on his todo list". I never heard anything else. 82 | 83 | The lesson I learned here was that **it's not my job to tell them if I think I am or am not qualified enough**. If I'm not qualified, they will tell me. It's my job to sell myself, my skills, and my passion as best as I can while being completely open and honest about my background and experience when they ask. After this experience, I changed the way I wrote my CV and cover letters to not include any downplaying, but rather *highlight* the skills and experience that correspond to the job requirements - even if those are just personal projects. 84 | 85 | Another take away was that sooner or later you will bump into elitists and gatekeepers. They want to maintain their preconceptions and world view, and there isn't too much you can do aside from not handing them additional ammo. 86 | 87 | ## Recruiters 88 | 89 | My opinion of recruiters wasn't particularly high before embarking on this journey, and at this point it has dropped to below zero. I'm not talking about headhunters at specific companies here - they are genuinely vested in finding good people that fit the role. I'm talking about recruiters from firms that get paid to place candidates. They are, by the nature of the dynamic, not out to help you. Their motivation is getting a candidate, who for the least amount of money will pass the probationary period. They may tell you otherwise, but remember - you are not the customer in this arrangement, you're the *product*. 90 | 91 | I spoke to a few different recruiters early on; Some on the phone, some over linkedin. One thing I found is that if you don't fit an easy mold, they don't even want to bother with you. In one exchange, I explained that I was looking to make a transition, and he told me it was basically impossible, that no one would hire me on anything but the most meager salary, and wouldn't I just like to look at these other backend webdev jobs that he was sure I would fit. When I made it clear that I would like to try the embedded route in spite of all of that, he told me he had a few positions that he would send me the details of. I never heard from him again. 92 | 93 | A similar thing happened when I spoke to a recruiter who ran her own company, specialised in high tech positions including embedded firmware. Again she told me it would be a tough pitch, but she would send me the details of a couple of companies and set up some interviews. Complete radio silence after. She was actually recommended by friend of my girlfriend who told me to mention him by name. 94 | 95 | I applied to a job online that I thought I was a great fit for, complete with custom CV and cover letter. I got a call less than an hour later from a guy who informed me that the site I'd applied through was a recruitment company, and they had to screen me before passing my details along. I spoke with him for over an hour, answering a ton of questions, explaining my background, experience, and fitness for the job. Thankfully, he actually seemed to think I would be a fit too, but needed to arrange things with his manager in order to proceed. He said he'd call me the next day. That day came, and I did receive a call, this time from his manager, who informed me that the job I'd applied had been filled more than a month before. However, he thought he had some other positions that would fit me well too. Of course, most of those were backend webdev. Needless to say, I was pretty pissed off at having wasted my time both in the application and phone call. I expressed this, stating that I wasn't interested in those positions. 96 | 97 | I did end up talking to a few companies via this recruitment agency (and I considered joining one of them), but working with them was generally terrible. They would call me at times I had explicitly told them I was unavailable. They would arrange in-person, on-site interviews without checking my availability. They would take the salary expectations I had given them and tell the company they set the interview up with a lower number. 98 | 99 | When I told them I wouldn't be able to make the appointments they had booked without checking, they had the gall to act indignant, and tell me this made them look bad. I think with younger developers, or perhaps those not as outspoken, this kind of manipulation tactic would have been effective - but I wasn't having any of it. I told them outright how unprofessional and disorganised their company appeared, and that I would just as well drop the whole thing - to which they immediately backpedaled, telling me they would reschedule. 100 | 101 | Your milage may vary, but I can honestly say that I will **never** waste my time with recruiters in the future. There are far better ways to find companies and talk to them directly, without having someone interject themselves into the middle in order to extract profit. 102 | 103 | ## Talking directly 104 | 105 | One of the resources that worked well for me was Linkedin's built in job search. I was able to narrow a position down to a location radius and seniority level, and pretty easily filter out things that weren't interesting. Some jobs even have an "easy apply" mode, where you simply press a button and it sends your CV - though a couple of jobs had additional checkbox style questions you'd have to answer. This was actually how I found and applied for the position I ended up taking. 106 | 107 | But I also used this search method (as well as searching on recruitment websites) to find positions that sounded interesting but didn't explicitly mention the company by name. I found that you could often google part of the job description, and find it posted elsewhere, with the companies name attached. That gives you an open to go straight to the source; Either applying directly on their website, or writing an email with your CV and bespoke cover letter. 108 | 109 | ## Some places are hiring because everything is on fire 110 | 111 | Hiring is often not an internal priority for a company, and they will actually wait until things are already in dire straits to actually begin the process. I experienced this in the agency I worked for, as well as multiple companies I spoke to. 112 | 113 | One position sounded incredibly promising. It was at an industrial robotics company, where they built custom automated lab setups for the agriculture and petro-chemical sectors. When I got there, I was surprised to find out they we're a pretty small shop - just one tech guy (the founder), a business guy who handled all the operations and sales, plus a few mechanical and electrical folks who would build the hardware. The tech guy, who I'll call `C`, was a genius - no doubt about it. He had designed and built the first robots, wrote all of the control code, programmed the PLCs, and built the client facing desktop software that queued the experiments and logged results. He'd been working on this for near on 20 years, pretty much entirely alone. It's actually an incredible achievement, and I truly admire the dedication and spirit it took. 114 | 115 | In the interview, it was clear that they'd never interviewed a software developer before. They asked me some of the questions that appear on a cursory google search of "interview questions". I don't hold this against them at all actually - hiring is hard, and I think most people are bad at it. I took it as an opportunity to ask a lot of questions about the job itself, the technologies, the methodologies, approach to deadlines, planning - all of the stuff you actually brush up against day to day. I also used the pretty open space to describe my skills in terms of the projects I'd worked on - personal and professional - as well as discussing my YouTube channel. In a kind of hilarious twist, the business guy - let's call him `D` - had never heard of YouTube before, and I found myself explaining the concept of people making video content, posting it online, and how it can, in some cases be used as a profitable source of income (though I wouldn't say LLJS falls into that category). They also took me to lunch to end the interview, which was very nice actually. They made it clear that they really liked me as a candidate, and would take some time between the two of them to talk it over. 116 | 117 | For me however, there was quite a big issue; namely, it wasn't an embedded firmware job at all. Most of the electronics control was done in commercial off the shelf PLCs. The job was really about maintaining the 20 year old Delphi codebase, perhaps with the idea of slowly rewriting parts of it in more modern languages. This software, which handled the experiment queuing, data ingest, and eventual processing. It also had effectively no documentation, and quite some tech debt (though they didn't call it that). And while both `C` and `D` assured me that there would, in some future, be possibilities to incorporate custom embedded solutions, any initial focus would be on maintenance work, refactoring, and generally servicing active and new clients. 118 | 119 | There was also a second troubling point - `C` was sick. One of the reasons they needed to hire someone right now was because he needed to be away in order to receive treatment. At the point when `D` was telling me this, `C` was in visible discomfort, though not from pain. It seemed that he was almost ashamed that he would have to stop with his life's work for something as petty as illness. He assured me it was not as bad as it sounded, and that he would be around and available to help me through things, and fully back within a few months. I really appreciate that they were open about it - a lot of places wouldn't be. 120 | 121 | I don't want to come off as insensitive here, `C`'s situation brought back some pretty rough memories of my own fathers fight with a sudden cancer diagnosis. He was also one of the most self-motivated people I've ever known, and wanted to continue working on building his business until his dying day. That said, while you can never predict this kind of turn of events, the idea of the bus factor is very well known. The entire business, active for 2 decades, hinged on a single man. All the knowledge, code, processes, exceptions, workarounds - everything - was in his head, and there was no time left to write any of it down. 122 | 123 | From a personal stand point, I wanted to help. I could see that this was a place that, despite the unfortunate circumstances and any chaos that would be thrown my way, it *was* somewhere I could have a big impact. They had made it clear that I would be able to form my own team and develop myself in any direction that I saw fit. From a professional stand point, taking a step back, it could easily also be a volatile place, where at any moment way too much responsibility and pressure could fall onto my shoulders before I was ready for it. 124 | 125 | They called me less than a week later for a second meeting - a video call - where they both offered me the job, and `C` showed me around the codebase a little. The code wasn't bad by any means; it was pretty modular in its design, and was split into dynamic libraries, which *would* allow for gradually rewriting parts without throwing everything out. I told them my concerns openly (though delicately), mainly highlighting that my ambition was to move into embedded, and this really wasn't it. In the end, I said I needed some time to think it through, and we settled on 2 weeks. 126 | 127 | ## Things come in threes 128 | 129 | During the interluding period, I had some 8 interviews with 5 other companies - 2 of which ended up making me offers. Two of the others were simply not a good fit for either party. 130 | 131 | One was a startup hiring their 8th or 9th employee, and they were looking for absolute dedication to the cause. I work 4 days per week (32 hours), which is common in Holland. One day per week I look after my daughter at home, and it's one of the things I was absolutely not willing to compromise on. The founder of the startup didn't believe that people could work effectively this way. While he is welcome to his opinion, I wholeheartedly disagree. I'm certain that I produce better results in 32 hours than I ever did in 40. Aside from that dealbreaker, they also informed me that there would be a series of 5 interviews, with multiple technical challenges, including a "half-day" take home test, and another in-person test if I got through. Honestly, that level seemed so utterly ridiculous to me that I probably would have declined on that basis alone. 132 | 133 | Another interview was with an embedded consultancy - one where the employees are shipped out to companies to work for anywhere from 6 months to several years. The interview went great, and I liked the directness of the owner, but it was just a little too far away travel-wise for me to really consider it. 134 | 135 | The other unaccounted company was a large producer of 3D printers. I spoke to their in house recruiter, who - surprise surprise - tried to push me towards the webdev/application dev jobs they had open rather than the embedded position I was interviewing for. I politely pushed back and made my case - and I actually had the impression she got where I was coming from. She told me she would schedule my next interview with the engineering manager, but he was out with a mild case of corona. I never heard anything from them again. It's such a strange thing when you consider that even with dozens of candidates, sending a pre-written rejection message would take a few seconds to send. I understand why it isn't done, but I really think it's a bad look. 136 | 137 | For the two that I ended up with offers from, one came via the unprofessional recruitment company. This company was also a startup, but with a few more employees, and a more concrete proof of concept. Without saying too much, they wanted someone to come in and write highly secure, smart card based applications for extremely low-level, low-latency control of a critical communication system. When I turned up to the interview, they made it clear they had done their research on me, my open source code, the projects on my YouTube channel, and they believed that from a technical stand point, they were sure that I would fit the bill. The interview mostly consisted of them explaining that this project is R&D, and that they needed someone who was ready to dive into a subject area that only a few people worldwide were currently versed in, using tech with little to no open documentation. Whatever information was out there would be hard to find, and under a pile of NDAs. I could tell they wanted to get a closer look at my character and my drive - to know if I was someone who could bash my head against the problem until something finally came loose. 138 | 139 | Honestly, it sounded like a dream job. They were even willing to pay quite a bit higher than the (wrong) number the recruitment agency had given them, plus equity in the company. The founder was clear that he wanted to get the tech to an MVP level, and sell the company to one of the industry giants, who were already very interested in the idea. But it would be something I would be more or less working on alone. I asked him to send me a written offer, outlining all the parameters, and to give me some time to mull it over. 140 | 141 | I was simultaneously speaking with the company I ended up joining - a power electronics firm, most active in the marine market, and already around for more than two decades. The first interview was with the engineering manager and the software lead over a video call, and was mostly introductory. They told me about the company, what kind of products they make, the markets they're active in, and what the job would consist of day-to-day. I told them about my background, passion, projects, and all the rest. The lead software engineer asked a few technical questions that were purely to filter out the those who really had no idea what they were doing - what kinds of architectures was I familiar with, how would I approach debugging, what kinds of protocols I had experience with. My answers were apparently satisfactory because within a day or so I was scheduled for an in-person meeting with the software lead and another veteran software engineer who had been at the company almost from the beginning. 142 | 143 | They had told me that there would be a kind of coding challenge, but not one I should worry about. "Every one of us did something similar", the software lead told me with a smile. But of course I worried about it. However, I also had a very honest and direct vibe from the conversation, and had it in my head that if I didn't make it through, it would be on me rather than on the arbitrariness of the test. 144 | 145 | Now I won't go into details about what the exact code was (because it's still in use, but after sitting down, a piece of A4 paper was slid over the table to me to assess. They told me that this was a simplified form of a real bug they had encountered in one of their products, and that I wasn't expected to necessarily solve it. They wanted me to read the code, speaking my thought process out loud, and highlighting what could potentially go wrong. I was - surprisingly - quite calm and even excited to start. Aside from it not being the usual medium that I would consume code in, grokking code like this is something I've practiced naturally over the years. I just started walking through it, saying what I saw, and stating assumptions as I built a mental model. They were communicative throughout the process, sometimes giving me information that would tell me whether my assumption was sound, and sometimes simply letting me continue without confirming or denying - but it didn't feel at all like a hostile environment. 146 | 147 | It turns out there were really a few bugs, and I managed to identify two of them. After I was more or less done, they brought out a more detailed version of the real code and explained the real life context. I asked how they thought I had done, and to my surprise and relief, they said it was probably the best anyone had ever done on that particular problem! 148 | 149 | I don't want to pat myself on the back here too much, but this is a tip that people might miss otherwise: *It's fine, and even very useful for you as a candidate, to ask for feedback during the process itself*. In an interview situation, the candidate is missing a lot of information. You might ask a question, only to be told that they're keeping their cards close to the chest - but this shouldn't have any bearing on the interview itself. If it does, that's useful information too, and might help you dodge a bullet where an organisation values secrecy and politics over communication and openness. In my case, it gave me a huge confidence boost that I carried forward into the subsequent meetings I had that day with various departments. 150 | 151 | I left feeling great, and within a few days they had sent me an offer. 152 | 153 | ## Knowing what is important to you is key 154 | 155 | I had to make a choice between 3 offers: The robotics company, the communications startup, and the power electronics marine firm. This was a very lucky position to be in, but quickly also became the source of a new problem: Needing to figure out how to compare and contrast 3 very different organisations; projecting your life into the future and trying to see what fits. In my opinion, the key here is to understand what is the most important to **you**. 156 | 157 | For some people, it's as simple as money. I had a very high offer from the communications company, albeit one with no pension. The robotics company had also all but offered to pay me whatever I'd asked for (within the bounds of reason). 158 | 159 | I realised that, for me, four factors were going to be paramount: 160 | 161 | - Travel time / ability to work partially remote 162 | - A real work/life balance 163 | - Projects where I can develop myself and cultivate experience in embedded software 164 | - Having a group of smart people around to learn from and share knowledge with 165 | 166 | I just couldn't see the robotics company fulfilling these - there were too many sticking points, and I didn't want the pressure of an entire business to fall on my shoulders. And while the communications firm would certainly be an interesting challenge, it would be one I'd tackle more or less alone. On top of that, I'd be working with technology and building skills that were not all that transferable. At the power electronics company, I saw not just a cast of talented firmware engineers, but also EEs, control systems engineers, and test engineers. I saw a company culture that values quality at every stage of the design and manufacturing of products - one with the stability and structure of 2 decades of running, where people have families, and work/life balance is encouraged. Perhaps most importantly, it was close to where I live - meaning I'd be able to maximise time with my family instead of spending it traveling. 167 | 168 | With the right lens, the choice was easy. 169 | 170 | ## The happy end 171 | 172 | Making a career move can be a daunting process. I had experienced push back throughout the process by people who wanted to place me in a box based on perceptions and judgments made in mere minutes. I learned not to sabotage myself by offering the idea that I might not be qualified. But principally, I learned that concretely identifying the things that are the most important to you, and using them as a guiding light, from beginning to end, will ultimately lead you to make the right choices when the moment comes around. 173 | 174 | Oh, and that I ***really*** dislike recruiters. 175 | 176 |
177 | 178 | *Many thanks to Nate, Void, and Greg helping me make this document coherent (and reigning in my over-zealous use of hypens)* -------------------------------------------------------------------------------- /2022/6/15/rolling-your-own-crypto-aes.md: -------------------------------------------------------------------------------- 1 | # Rolling your own crypto: Everything you need to build AES from scratch (and then never use it for anything of consequence) 2 | 3 | You often hear the phrase **"Don't roll your own crypto"**. I think this sentence is missing an important qualifier: **"...and then use it for anything of consequence"**. If you are building a product or service, or are trying to communicate privately, then you should absolutely pick a vetted, open source, off-the-shelf implementation and use it. If, however, your goal is to learn, then there is honestly no better way than simply hacking away on your own code! 4 | 5 | Before we get into it, just a quick word about why the phrase is so often touted: Cryptography is hard to get right for a number of reasons. First, there is the mathematical side of things, where a slip-up can take something that takes the *lifetime of the universe* to break, into something broken in *minutes* by someone with a bit of compute power. Lesser known perhaps, but still equally serious, is the issue of side-channels. The code you write can be completely correct, but still leak secrets through cache timing attacks, or even measured fluctuations in power usage as the algorithms are running. These are not academic attacks either - they are possible with off the shelf hardware, and someone with enough know-how to pull it off. The folks who write the industrial-scale crypto libraries are well aware of both of these aspects, and trust me: it's better to just leave it to them when it really matters. They're a pretty smart bunch. 6 | 7 | ## Do~~n't~~ try this at home 8 | 9 | With the warnings out of the way, let's talk about why you **would** want to build your own crypto. One reason is [just for fun](https://justforfunnoreally.dev/)!. Another might be that you actually want to *become* one of those people who work on the industrial scale crypto. Finally, you might be interested from the red-team perspective; learning so that you try to attack the poor souls who write the insecure code. 10 | 11 | I'm by no means an expert, but I recently got interested in peering behind the curtain while reading Jean-Philippe Aumasson's book [Serious Cryptography](https://nostarch.com/seriouscrypto). The chapter on AES gave a great overview of the algorithm, as well as its *modes of operation*, but some of the details were still a little fuzzy. This post is my attempt to explain some of the parts that were only clear to me after reading the [spec](https://nvlpubs.nist.gov/nistpubs/fips/nist.fips.197.pdf), trawling through code, and spelunking through an underground network of wikipedia rabbit holes. 12 | 13 | ## Just give me the code! 14 | 15 | There is a full, [open source implementation in C](https://github.com/francisrstokes/AES-C) that accompanies this article. I strongly recommend reading the source as a supplement to the article, as it necessarily explores all of the ideas in their full detail. 16 | 17 | ## AES 18 | 19 | AES, or the "Advanced Encryption Standard", is a extremely widely used symmetric block *cipher*. Symmetric here refers to the idea that both the one encrypting and the one decrypting use the same key. Block refers to the way in which the stuff you're encoding (the *plaintext*) is turned into the *ciphertext* (the random-looking but reversible sequence of bytes that is the result of the encryption). Cipher refers to any algorithm for encryption. A block cipher operates on multiple bytes of plaintext at the same time, arranged into a 2D block. 20 | 21 |
22 | 23 | Each block is made up of 16 bytes (128 bits), and is arranged in column-major order. The big idea with AES is that this block is scrambled and mutated in a way that is completely reversible, driven by the *secret key*. The secret key is simply a sequence of bits that should only be known to the sender and the receiver. The choice of key must be as close to truly random as possible. 24 | 25 | A block cipher can be contrasted with something like a *stream* cipher, where each byte of the plaintext is encrypted by itself using a *key stream*. 26 | 27 | When the details and mathematical abstractions have had time to sink in, AES may actually appear remarkably simple, but do not be deceived; this simplicity is the result of careful engineering and pragmatism. For me personally, it was humbling to realise that my being able to understand and implement AES was not a reflection on me or my own skills, but rather the ingeniousness of the designers. 28 | 29 | It is my aim that you come away from this article having a very good idea of how to implement AES yourself. We're going to look at the algorithm in-depth, from beginning to end, answering questions like: 30 | 31 | - What are the operations and transformations involved? 32 | - What kind of math underpins AES? 33 | - What is key expansion and how does it work? 34 | - How do you encrypt data that isn't a multiple of the block size? 35 | - How does the same mechanism used in different modes produce results that vary in their security level? 36 | 37 | ## Operations and transformations 38 | 39 | AES transforms each 16-byte block of data by a combination of moving the bytes around, performing reversible mathematical operations on them, and swapping them out for other bytes in a lookup table. These operations are used both to expand the secret key, deriving a further set of keys used throughout the encryption process, and the encryption process itself. The expansion process is called the *key schedule*. 40 | 41 | The encryption process takes place over a number of *rounds*, and in each round the same set of operations is applied to the block, each time using one of the keys derived from key expansion. 42 | 43 | The concrete steps of the algorithm are as follows for every block of the plaintext input: 44 | 45 | 0. Key schedule 46 | - The key schedule process is used to take the secret key, and derive an extended set of *round keys* 47 | 48 | 1. Addition of the first round key 49 | - The first round key (which is the secret key itself) is "added" to the input. This is not regular addition, but rather addition defined for the finite field $GF(2^8)$, which we will expand on in detail 50 | 51 | 2. A series of "rounds", the exact number of which is defined by the length of the secret key. We will talk about 128-bit keys in this article, in which there are 10 rounds. In a round, a series of operations take place: 52 | - Substitute Bytes 53 | - Shift Rows 54 | - Mix Columns 55 | - Adding the Round Key 56 | 57 | 3. The final round 58 | - The final round is the same as the previous rounds, except that the *Mix Columns* step is skipped 59 | 60 | In the following sections, we will explore the practical and theoretical mechanics of each of these steps. It's not necessary to cover the operations in order to understand AES, so we will start with some of the simpler operations and then move on some of the more challenging parts. 61 | 62 | ### Substitute Bytes (SubBytes) 63 | 64 | The simplest transformation to understand is *substitute bytes*, or `SubBytes` as it's known in the spec. This operation takes place in each round, as well as in the key scheduling process. In this step, every byte in the block is swapped with a corresponding byte in a lookup table called an *s-box*. The s-box used during encryption is: 65 | 66 | | |00|01|02|03|04|05|06|07|08|09|0a|0b|0c|0d|0e|0f| 67 | |------|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--| 68 | |**00**|63|7c|77|7b|f2|6b|6f|c5|30|01|67|2b|fe|d7|ab|76| 69 | |**10**|ca|82|c9|7d|fa|59|47|f0|ad|d4|a2|af|9c|a4|72|c0| 70 | |**20**|b7|fd|93|26|36|3f|f7|cc|34|a5|e5|f1|71|d8|31|15| 71 | |**30**|04|c7|23|c3|18|96|05|9a|07|12|80|e2|eb|27|b2|75| 72 | |**40**|09|83|2c|1a|1b|6e|5a|a0|52|3b|d6|b3|29|e3|2f|84| 73 | |**50**|53|d1|00|ed|20|fc|b1|5b|6a|cb|be|39|4a|4c|58|cf| 74 | |**60**|d0|ef|aa|fb|43|4d|33|85|45|f9|02|7f|50|3c|9f|a8| 75 | |**70**|51|a3|40|8f|92|9d|38|f5|bc|b6|da|21|10|ff|f3|d2| 76 | |**80**|cd|0c|13|ec|5f|97|44|17|c4|a7|7e|3d|64|5d|19|73| 77 | |**90**|60|81|4f|dc|22|2a|90|88|46|ee|b8|14|de|5e|0b|db| 78 | |**a0**|e0|32|3a|0a|49|06|24|5c|c2|d3|ac|62|91|95|e4|79| 79 | |**b0**|e7|c8|37|6d|8d|d5|4e|a9|6c|56|f4|ea|65|7a|ae|08| 80 | |**c0**|ba|78|25|2e|1c|a6|b4|c6|e8|dd|74|1f|4b|bd|8b|8a| 81 | |**d0**|70|3e|b5|66|48|03|f6|0e|61|35|57|b9|86|c1|1d|9e| 82 | |**e0**|e1|f8|98|11|69|d9|8e|94|9b|1e|87|e9|ce|55|28|df| 83 | |**f0**|8c|a1|89|0d|bf|e6|42|68|41|99|2d|0f|b0|54|bb|16| 84 | 85 | Every 8-bit value is mapped to a different 8-bit value. In order to substitute a byte, the value is used as an index into the s-box table. 86 | 87 |
88 | 89 | When decrypting, the *inverse* s-box is used. If you think of an s-box as a set of pairs `(index, value)`, then you can construct the inverse table by swapping index and value: `(value, index)`. For completeness, here is the decryption s-box: 90 | 91 | | |00|01|02|03|04|05|06|07|08|09|0a|0b|0c|0d|0e|0f| 92 | |------|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--| 93 | |**00**|52|09|6a|d5|30|36|a5|38|bf|40|a3|9e|81|f3|d7|fb| 94 | |**10**|7c|e3|39|82|9b|2f|ff|87|34|8e|43|44|c4|de|e9|cb| 95 | |**20**|54|7b|94|32|a6|c2|23|3d|ee|4c|95|0b|42|fa|c3|4e| 96 | |**30**|08|2e|a1|66|28|d9|24|b2|76|5b|a2|49|6d|8b|d1|25| 97 | |**40**|72|f8|f6|64|86|68|98|16|d4|a4|5c|cc|5d|65|b6|92| 98 | |**50**|6c|70|48|50|fd|ed|b9|da|5e|15|46|57|a7|8d|9d|84| 99 | |**60**|90|d8|ab|00|8c|bc|d3|0a|f7|e4|58|05|b8|b3|45|06| 100 | |**70**|d0|2c|1e|8f|ca|3f|0f|02|c1|af|bd|03|01|13|8a|6b| 101 | |**80**|3a|91|11|41|4f|67|dc|ea|97|f2|cf|ce|f0|b4|e6|73| 102 | |**90**|96|ac|74|22|e7|ad|35|85|e2|f9|37|e8|1c|75|df|6e| 103 | |**a0**|47|f1|1a|71|1d|29|c5|89|6f|b7|62|0e|aa|18|be|1b| 104 | |**b0**|fc|56|3e|4b|c6|d2|79|20|9a|db|c0|fe|78|cd|5a|f4| 105 | |**c0**|1f|dd|a8|33|88|07|c7|31|b1|12|10|59|27|80|ec|5f| 106 | |**d0**|60|51|7f|a9|19|b5|4a|0d|2d|e5|7a|9f|93|c9|9c|ef| 107 | |**e0**|a0|e0|3b|4d|ae|2a|f5|b0|c8|eb|bb|3c|83|53|99|61| 108 | |**f0**|17|2b|04|7e|ba|77|d6|26|e1|69|14|63|55|21|0c|7d| 109 | 110 | If you're anything like me, then you're probably wondering where these substitution values even come from. Can you use any old values? The answer is yes and no. This s-box was designed by Joan Daemen and Vincent Rijmen - the creators of AES (which is also known as "Rijndael") - to limit its [statistical linearity](https://en.wikipedia.org/wiki/Linear_cryptanalysis), and susceptibility to [differential cryptanalysis](https://en.wikipedia.org/wiki/Differential_cryptanalysis). A different set of (poorly chosen) values could drastically reduce the security of the cipher. Interestingly, some implementers *do* choose to use their own values to overcome the possibility that Daemen & Rijmen purposefully inserted a kind of mathematical backdoor, but as far as I know, there is no evidence to suggest this kind of backdoor exists. 111 | 112 | ## Shift Rows 113 | 114 | In the Shift Rows operation, as the name might suggest, some of the rows of the current block are shifted. The first row is not shifted at all (alternatively you could think of it having a shift of 0 bytes). The second row is shifted left by 1 byte, the third row 2 bytes, and the fourth row 3 bytes. The "shift" is actually better described as a *rotation*, because the bytes that are shifted outside of the row come back around into the empty spaces. 115 | 116 |
117 | 118 | This operation breaks up the columns, and prevents the algorithm from being based only on columnar encryption, which would weaken it from a statistical analysis point of view. 119 | 120 | ## A diversion into finite fields 121 | 122 | In order to dive further into the operations involved, we need to learn about finite field math. In AES, some addition and multiplication is required to perform the *Add RoundKey* and *Mix Columns* operations, as well as expanding the secret key in the *Key Schedule* process. However, these do not behave as you might be used to. 123 | 124 | As a small note here before we get into it: I am not a mathematician, I am a software engineer. As such, my definitions necessarily do not really even scratch the surface of these topics, nor do they have the rigor you would find from a true mathematician. I know just enough to be dangerous, so apologies if my explanation lacks depth or contains errors. Lastly, while it's not strictly necessary to understand these concepts in order to simply *implement* them, I personally didn't feel accomplished until I did. 125 | 126 | So first things first: What is a finite field? A finite field is a *field* that has a finite number of elements. Your next question, depending on your familiarity with the subject matter, may be: What is a field in this context? A field is a collection of elements (called a set) with a valid definition for addition, subtraction, multiplication, and division. A field can (and often does) have an infinite set of elements. The field operations must conform to a set of axioms, known as the *field axioms*: 127 | 128 | - **Associativity** of addition and multiplication: 129 | - $a + (b + c) = (a + b) + c$ 130 | - $a \bullet (b \bullet c) = (a \bullet b) \bullet c$ 131 | - **Commutativity** of addition and multiplication: 132 | - $a + b = b + a$ 133 | - $a \bullet b = b \bullet a$ 134 | - **Identity** for addition: 135 | - There is some $0$ such that $a + 0 = a$ 136 | - **Identity** for multiplication: 137 | - There is some $1$ such that $a \bullet 1 = a$ 138 | - **Inverse** for addition: 139 | - For every element $a$ in the field $F$, there exists an element $-a$, such that $a + (-a) = 0$ 140 | - **Inverse** for multiplication: 141 | - For every element $a$ in the field $F$, there exists an element $a^{-1}$ or $1/a$, such that $a \bullet a^{-1} = 1$ 142 | - **Distributivity** of multiplication over addition: 143 | - $a \bullet (b + c) = (a \bullet b) + (a \bullet c)$ 144 | 145 | Division by zero is typically undefined. One famous field you've probably heard about is the *real numbers*. Another is the *rationals*. 146 | 147 | AES uses a finite field (a Galois field), notated as $GF(2^8)$. What is interesting about this kind of field, and indeed any Galois field over a power of two, is that there is an intrinsic mapping to binary numbers and their related operations - which I'm told computers deal very well with. 148 | 149 | In order to see how $GF(2^8)$ works, let's take a look at how it defines its operations, and how those operations conform to the field axioms. 150 | 151 | ### Addition 152 | 153 | In $GF(2^8)$, addition takes place on polynomial expressions - which map almost directly to binary numbers, since $GF(2^8)$ contains 256 elements, and in binary, a byte can take on 256 unique values. 154 | 155 | Addition is defined as the *exclusive or* logical operation (`^` in most languages). This is a little different from some of the conventional fields, but it does meet the criteria set out by the field axioms: 156 | 157 | - Associativity: `a ^ (b ^ c) == (a ^ b) ^ c` 158 | - Commutativity: `a ^ b == b ^ a` 159 | - Identity: `a ^ 0 == a` 160 | - Inverse: `a ^ a == 0` (i.e. every element is its own inverse) 161 | 162 | Note that on its own, addition can't fulfill all of the axioms; We still need multiplication, and we'll get to that shortly. Before we do, though, we need to talk about the representation of the elements of the field. 163 | 164 | Since we're in what is essentially *[abstract algebra](https://en.wikipedia.org/wiki/Abstract_algebra)* land, giving symbols like `2`, `100`, or `42` to the elements in the field doesn't make too much sense (it is however still conventional to call the additive identity `0` and the multiplicative identity `1` 🤷‍♂️). Instead, any given number tends to be represented as a *polynomial expression*. A polynomial expression is one in the form: 165 | 166 | $$ 167 | a_nx^n + a_{n-1}x^{n-1} + \ldots + a_1x + a_0 168 | $$ 169 | 170 | In $GF(2^8)$, any member can be expressed as a polynomial expression with 8 terms, and where the coefficients are either zero or one. This is basically what binary numbers (or any numbers in a particular base) are really - the summation of each of the possible places, multiplied by a constant which is not more than the number of symbols available in the base minus one. 171 | 172 | Concretely, the binary number `10001101` can be expressed equivalently in polynomial form: 173 | 174 | $$ 175 | x^7 + x^3 + x^2 + 1 176 | $$ 177 | 178 | Note that since the only possible coefficients we could have are `0` or `1`, we simply include the term representing a place if it's a one, or not at all if it's a zero. 179 | 180 | ### Multiplication 181 | 182 | Mechanically, we can take advantage of the fact that multiplication must adhere to the field axioms, and that polynomial expressions can be manipulated algebraically. 183 | 184 | If we have have two numbers: `0x42` and `0xAB` (`0b01000010` and `0b10101011` respectively), then we can represent their product in polynomial form like so: 185 | 186 | $$ 187 | (x^6 + x) \bullet (x^7 + x^5 + x^3 + x + 1) 188 | $$ 189 | 190 | Then we can use the fact that our multiplication must be distributive, along with the fact that multiplying exponent terms in a polynomial equates to adding the powers, to expand out the expression: 191 | 192 | $$ 193 | \begin{aligned} 194 | & (x^6 + x) \bullet (x^7 + x^5 + x^3 + x + 1) \\ 195 | &= (x^{13} + x^{11} + x^9 + x^7 + x^6) + (x^8 + x^6 + x^4 + x^2 + x) \\ 196 | &= x^{13} + x^{11} + x^9 + x^8 + x^7 + x^4 + x^2 + x 197 | \end{aligned} 198 | $$ 199 | 200 | The two $x^6$ terms cancel out, because the coefficients can only be `1` or `0` in this finite field, but would be `2` in this case (`2 % 2` is `0`). 201 | 202 | This has the effect of turning the original problem into what is essentially just another binary number. The only small hitch is that this can clearly create elements that lay outside of the field; in this example, we end up with the binary number `0b10101110010110` (11158), which is definitely bigger than 256. For this reason, this kind of finite field must have an associated *irreducible polynomial*. The true result of multiplication is the expression modulo the irreducible polynomial. In the case of AES, the irreducible polynomial is: 203 | 204 | $$ 205 | x^8 + x^4 + x^3 + x + 1 206 | $$ 207 | 208 | In binary form, this would be `0b100011011` (or `0x11B` in hex). For illustrative purposes, we can compute the modulo by hand using long division. Since division is just repeated subtraction, and `xor` defined as addition also happens to be the valid definition for subtraction, this ends up being a very simple process. 209 | 210 | ``` 211 | 10101110010110 % 100011011 212 | ^ 100011011 213 | -------------- 214 | 00100011110110 215 | ^ 100011011 216 | ------------ 217 | 000000101110 218 | 219 | = 101110 = 0x2E 220 | ``` 221 | 222 | ### A computational approach 223 | 224 | Long division by hand is, however, not really an approach that maps well to computers. One of the parts that tripped me up when learning about AES and finite fields from more theoretical sources was understanding how to translate the algebra and equations to an algorithm. The answer that I'm going to show here is just one possibility, and not at all the most efficient. In fact, implementing AES in the way this article explains it is not the way you would find it in any *real* system, but it is probably the most direct translation of the key ideas. 225 | 226 | The key to understanding this algorithm is realising that if we only partially simplify the multiplication expression from 227 | 228 | $$ 229 | (x^6 + x) \bullet (x^7 + x^5 + x^3 + x + 1) 230 | $$ 231 | 232 | to 233 | 234 | $$ 235 | x^6 \bullet (x^7 + x^5 + x^3 + x + 1) + x \bullet (x^7 + x^5 + x^3 + x + 1) 236 | $$ 237 | 238 | it becomes quite a bit easier to compute in binary. If we translate this partially simplified expression into the binary number view, we get something like this: 239 | 240 | $$ 241 | 01000000 \bullet 10101011 + 00000010 \bullet 10101011 242 | $$ 243 | 244 | At each place of the binary number, we're multiplying by a power of 2, so we can do some more mathematical rearranging and produce: 245 | 246 | $$ 247 | 10101011000000 + 101010110 248 | $$ 249 | 250 | We can get to this by shifting the left hand side of the multiplication down until 1, and shifting the right side up by the same amount. The problem is now one of shifts, xors, and accumulating a total, which is something easily done in any programming language. The following is written in C, but can be easily translated into any other language: 251 | 252 | ```C 253 | uint8_t GF_Mult(uint8_t a, uint8_t b) { 254 | uint8_t result = 0; 255 | uint8_t shiftGreaterThan255 = 0; 256 | 257 | // Loop through each bit in `b` 258 | for (uint8_t i = 0; i < 8; i++) { 259 | // If the LSB is set (i.e. we're not multiplying out by zero for this polynomial term) 260 | // then we xor the result with `a` (i.e. adding the polynomial terms of a) 261 | if (b & 1) { 262 | result ^= a; 263 | } 264 | 265 | // Double `a`, keeping track of whether that causes `a` to "leave" the field. 266 | shiftGreaterThan255 = a & 0x80; 267 | a <<= 1; 268 | 269 | // The next bit we look at in `b` will represent multiplying the terms in `a` 270 | // by the next power of 2, which is why we can achieve the same result by shifting `a` left. 271 | // If `a` left the field, we need to modulo with irreducible polynomial term. 272 | if (shiftGreaterThan255) { 273 | // Note that we use 0x1b instead of 0x11b. If we weren't taking advantage of 274 | // u8 overflow (i.e. by using u16, we would use the "real" term) 275 | a ^= 0x1b; 276 | } 277 | 278 | // Shift `b` down in order to look at the next LSB (worth twice as much in the multiplication) 279 | b >>= 1; 280 | } 281 | 282 | return result; 283 | } 284 | ``` 285 | 286 | This algorithm is a slight modification on what is sometimes known as the [peasants algorithm](https://en.wikipedia.org/wiki/Ancient_Egyptian_multiplication) - which has been used for thousands of years across various cultures. 287 | 288 | ## Mix Columns 289 | 290 | Now that we have an understanding of what it means to *add* and *multiply* in the context of AES, we can take a look at the `MixColumns` operation. Again, in another case of thing-named-well, the name `MixColumns` tells us a lot about what this operation does at a high level. In essence, we're taking each of the columns in the current block, and mixing the elements together in such a way as to provide the algorithm with *diffusion*. Diffusion is the idea that changing a single bit in the plaintext leads to approximately half of the bits in the ciphertext to change. 291 | 292 | Unlike `ShiftRows`, this is not simply a transposition of the bytes. Instead, each byte in the column is transformed by combining every byte in the columm in a unique but reversible way. Mathematically it is best understood as a vector-matrix multiplication, where the column is the vector, and the matrix is a set of coefficients. The multiplication and addition of each of the elements takes place of course in $GF(2^8)$. 293 | 294 | $$ 295 | \begin{bmatrix} 296 | b_0 \\ 297 | b_1 \\ 298 | b_2 \\ 299 | b_3 \\ 300 | \end{bmatrix} = 301 | \begin{bmatrix} 302 | 02 & 03 & 01 & 01 \\ 303 | 01 & 02 & 03 & 01 \\ 304 | 01 & 01 & 02 & 03 \\ 305 | 03 & 01 & 01 & 02 \\ 306 | \end{bmatrix} 307 | \begin{bmatrix} 308 | a_0 \\ 309 | a_1 \\ 310 | a_2 \\ 311 | a_3 \\ 312 | \end{bmatrix} 313 | $$ 314 | 315 | Manually expanding all of the matrix operations in code would yield something like the following in C: 316 | 317 | ```C 318 | void MixColumn(uint8_t column[]) { 319 | uint8_t temp[] = { 0, 0, 0, 0 }; 320 | 321 | temp[0] = GF_Mult(0x02, column[0]) ^ GF_Mult(0x03, column[1]) ^ column[2] ^ column[3]; 322 | temp[1] = column[0] ^ GF_Mult(0x02, column[1]) ^ GF_Mult(0x03, column[2]) ^ column[3]; 323 | temp[2] = column[0] ^ column[1] ^ GF_Mult(0x02, column[2]) ^ GF_Mult(0x03, column[3]); 324 | temp[3] = GF_Mult(0x03, column[0]) ^ column[1] ^ column[2] ^ GF_Mult(0x02, column[3]); 325 | 326 | for (size_t i = 0; i < 4; i++) { 327 | column[i] = temp[i]; 328 | } 329 | } 330 | ``` 331 | 332 | The entire `MixColumns` operation does this for all 4 columns in the block. 333 | 334 | ## Key expansion and the key schedule 335 | 336 | As mentioned a little earlier in the article, AES includes a mechanism for taking the secret key (which for the purposes of this article we will say is 128-bits), and expanding it into a series of *round keys* called the *key schedule*. The process of generating these round keys is fairly straightforward, and makes use of some of the operations we've already seen, as well as some we haven't yet looked at. 337 | 338 |
339 | 340 |
CC BY-SA: Sissssou via https://en.wikipedia.org/wiki/AES_key_schedule
341 | 342 | We start by splitting the secret key into 4 columns. The secret key already constitutes the first round key. The last column is transformed by 3 operations: 343 | 344 | - `RotWord` 345 | - `SubWord` 346 | - `Rcon` 347 | 348 | In the `RotWord` operation, the bytes in the column are *rotated*, much the same as they are in `ShiftRows`, except that the bytes are only ever rotated by one. 349 | 350 |
351 | 352 | In the `SubWord` operation, every byte is substituted using the S-box we looked at in the `SubBytes` step. 353 | 354 | Finally, in the `Rcon` operation (round constant), the column is added to a predefined, constant column corresponding to the current round. The addition here is of course the `xor` operation. For a 128-bit key, these constant column vectors are: 355 | 356 | 357 | $$ 358 | \begin{bmatrix} 359 | 01 \\ 360 | 00 \\ 361 | 00 \\ 362 | 00 \\ 363 | \end{bmatrix} 364 | \begin{bmatrix} 365 | 02 \\ 366 | 00 \\ 367 | 00 \\ 368 | 00 \\ 369 | \end{bmatrix} 370 | \begin{bmatrix} 371 | 04 \\ 372 | 00 \\ 373 | 00 \\ 374 | 00 \\ 375 | \end{bmatrix} 376 | \begin{bmatrix} 377 | 08 \\ 378 | 00 \\ 379 | 00 \\ 380 | 00 \\ 381 | \end{bmatrix} 382 | \begin{bmatrix} 383 | 10 \\ 384 | 00 \\ 385 | 00 \\ 386 | 00 \\ 387 | \end{bmatrix} 388 | \begin{bmatrix} 389 | 20 \\ 390 | 00 \\ 391 | 00 \\ 392 | 00 \\ 393 | \end{bmatrix} 394 | \begin{bmatrix} 395 | 40 \\ 396 | 00 \\ 397 | 00 \\ 398 | 00 \\ 399 | \end{bmatrix} 400 | \begin{bmatrix} 401 | 80 \\ 402 | 00 \\ 403 | 00 \\ 404 | 00 \\ 405 | \end{bmatrix} 406 | \begin{bmatrix} 407 | 1b \\ 408 | 00 \\ 409 | 00 \\ 410 | 00 \\ 411 | \end{bmatrix} 412 | \begin{bmatrix} 413 | 36 \\ 414 | 00 \\ 415 | 00 \\ 416 | 00 \\ 417 | \end{bmatrix} 418 | $$ 419 | 420 | After the last column in the initial round key has been transformed by these operations, the *next* 4-column round key is derived using the following steps: 421 | 422 | - (Column 1) Adding the transformed column to the first column of the previous round key 423 | - (Column 2) Adding the new round key *Column 1* to the second column of the previous round key 424 | - (Column 3) Adding the new round key *Column 2* to the third column of the previous round key 425 | - (Column 4) Adding the new round key *Column 3* to the fourth column of the previous round key 426 | 427 | This whole process is repeated 10 times, yielding 11 total keys. 428 | 429 | ## Add Round Key 430 | 431 | Understanding the key schedule, as well as the math behind $GF(2^8)$ makes understanding the `AddRoundKey` operation quite straightforward. The input block is added to the corresponding round key, generated during *key expansion*. 432 | 433 | ## Algorithm recap 434 | 435 | With all of those operations in mind, let's revisit the bird's-eye algorithm view again. First of all, *key expansion* takes place, using the 128-bit secret key provided by the user. Then, for any given 128-bit block of plaintext data, the following transformation is applied: 436 | 437 | 1. Addition of the first round key 438 | 2. 9 Rounds: 439 | - Substitute Bytes 440 | - Shift Rows 441 | - Mix Columns 442 | - Adding the Round Key 443 | 3. The final round 444 | - Substitute Bytes 445 | - Shift Rows 446 | - Adding the Round Key 447 | 448 | The output of these operations is an encrypted block of ciphertext. Decryption of the data is the precise inverse. The operations are performed in reverse, including shifting in the other direction during the `ShiftRows` step, using a different S-box in the `SubBytes` step, and using a different coefficient matrix during the `MixColumns` step. 449 | 450 | One question you might rightly ask yourself at this point is: How do you encrypt data that isn't an exact multiple of 128-bits? There are a few different approaches you can take, but one of the most common is a clever trick that adds, at most, 128 bits of extra information to the ciphertext. 451 | 452 | ## Encrypting data that isn't a multiple of 128 bits 453 | 454 | The trick is to work out how many additional bytes would be required to have a full 16 byte block. Then the block is padded with that many bytes, where each of those bytes has the value equal to the number of padded bytes. 455 | 456 |
457 | 458 | The block is then encrypted as normal. If the data is already 16 bytes, an additional block is added, where every byte is the value `0x10` (16). 459 | 460 | During decryption, the very last block of data is checked for this pattern. The last byte in the block should contain the number of padding bytes. If that number of bytes with the same value are not found, then in most cases the decryption will be considered to be invalid or corrupted. If bytes do match, the padding is stripped from the final output. 461 | 462 | Note that this method is not part of the actual AES specification, rather it is specified as part of the [Public Key Cryptography Standards RFC 5652](https://datatracker.ietf.org/doc/html/rfc5652#section-6.3). The AES specification really only describes how the key schedule works, and how to encrypt a single block of data. 463 | 464 | ## Modes of operation 465 | 466 | How entire files can be encrypted and decrypted in a systematic way is what is known as the *modes of operation* - and AES has several of them. We're going to take a look at two of them - ECB and CBC - and understand how even if the encryption algorithm itself is sound, how it is applied to data can create a new set of weaknesses. 467 | 468 | ### ECB: Electronic Code Book 469 | 470 | The first mode of operation is ECB, or the *Electronic Code Book*. This mode is the most obvious method of using the algorithm; First compute all of the round keys, and then iterate through the plaintext data in chunks of 128-bits, turning that data into a block, and running it through the algorithm. The ciphertext output is placed into an output buffer or file, and the next 128-bits is loaded and processed. 471 | 472 | There is a problem with this mode, however. While the encryption itself is completely secure (when implemented correctly), for some kinds of data, ECB can actually leak structural information. The most famous example of this is "the penguin". 473 | 474 |
475 | 476 | This is Tux, the well known mascot of the linux community. If we encrypt this image data in a bitmap (i.e. non-compressed) form using AES-ECB (making sure to fix any headers required for the image to correctly render), then something like this very alarming image pops out. 477 | 478 |
479 | 480 | Now, while the data in the second image is very clearly scrambled, we can all very much *still see the penguin*. 481 | 482 | You may already have some idea of what's going on here. In AES-ECB, every block is encrypted individually, using the same key schedule. That has the implication that, if we feed the same 16 bytes of plaintext into the encryption algorithm multiple times, we will always get the same encrypted output. In the plaintext image of Tux, each pixel is encoded with 3 bytes - one for red, one for green, and one for blue. This means that any given block fed to the algorithm only contains ~5 pixels. On an image such as this, where the colour palette is limited, and large portions contain only one colour, the result is that those large blocks of repeated colour are simply replaced with a corresponding jumbled pattern; But importantly, that pattern is always the same, so the overall structure of the image is preserved. 483 | 484 | Even if the data itself is secure, leaking structural information like this can be still be devastating in sensitive contexts. To remedy the problem, we can use another mode of operation, called CBC, or *cipher block chaining*. 485 | 486 | ### CBC: Cipher Block Chaining 487 | 488 | In AES-CBC, the problem is addressed by forcing interdependence between all of the blocks in the plaintext data. When encrypting, a random IV, or *initial value/initialization vector* is chosen. The IV is simply a random 16-byte block. This IV is added to the first plaintext block before it is encrypted. The output is the first ciphertext block, $CT_0$. The next plaintext block is then added to $CT_0$ before being encrypted, forcing a reversible chain of dependence between the IV, the first plaintext block, and the second. This chain continues for every block in the plaintext input. 489 | 490 |
491 | 492 | The IV is transmitted *in the clear* (i.e. not encrypted), along with the ciphertext. If we encrypt the Tux image data in CBC mode instead, the result is indistinguishable from random noise: 493 | 494 |
495 | 496 | This approach is clearly more secure, but comes at a cost. Whereas ECB is completely parallelizable (it's [embarrassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel) in fact), CBC must be run sequentially. Most people will agree, however, that when you really want to keep something, a little extra time is probably worth the trade off. 497 | 498 | ### Other modes of operation 499 | 500 | There are many more modes of operation in AES that are out of scope for this article. AES-GCM, or *galois/counter mode*, is a mode that facilitates *authenticated* encryption. With authenticated encryption, the aim is to ensure that the receiver of an encrypted message knows with certainty that the message indeed came from the expected sender. Other modes, such as AES-CFB, allow AES to be used in a streaming fashion. 501 | 502 | Modes of operation turn out to be equally as important as the algorithm itself, and choosing the wrong mode for the data, context, and situation, can have large implications for both security and performance. 503 | 504 | ## Closing words 505 | 506 | I used to think of encryption as, essentially, magic - and dark magic at that! But there is nothing like diving into the details and building something from the ground up to show that all things can, with a little work and patience, be understood. I hope this article arms you with the tools and the confidence to try some of this out for yourself. I don't expect that the information presented here will be for you to write a complete implementation immediately, but it should form a good starting point. The specification provides not only a very approachable view on the subject, but also appendices with test vectors and expected outputs for all steps of the various operations. Aumasson's book is also a fantastic resource, not only for information about AES, but for a excellent overview on many different types of modern encryption and the field as a whole. 507 | 508 | ## Appendix 1: Generating the tux images 509 | 510 | The Tux image used to show the weakness of ECB was generated by converting the original [Tux SVG image](https://commons.wikimedia.org/wiki/File:Tux.svg) into the PPM raw image format. PPM is a binary format that directly encodes the pixel values for every pixel in the image. There is a ~60 byte header that describes the version of the format used, the dimensions, and the maximum value that a colour value can take on. 511 | 512 | In order to generate the encrypted image in binary form, the follow script was run: 513 | 514 | ```bash 515 | #!/usr/bin/env bash 516 | 517 | # Create a random 16 bit key (not at all a secure way of generating keys!) 518 | cat /dev/urandom | head -c 16 > key.bin 519 | 520 | # Extract the header from the original image 521 | head -c 61 Tux.ppm > header.txt 522 | 523 | # Exract the image data from the original image 524 | tail -c+62 Tux.ppm > body.bin 525 | 526 | # Encrypt the image data using AES-ECB 527 | aes-c -e -m ecb -k key.bin -i body.bin -o body.enc.bin 528 | 529 | # Concatenate the header + encrypted image data 530 | cat header.txt body.enc.bin > tux.enc.ppm 531 | ``` 532 | 533 | After the resulting PPM image was generated, it was converted to png in order to be shown in this article. With just a little tweaking, you can produce a beautiful array of leaky tuxes in gif form: 534 | 535 |
536 | 537 | ## Appendix 2: Further reading 538 | 539 | - https://nvlpubs.nist.gov/nistpubs/fips/nist.fips.197.pdf 540 | - https://en.wikipedia.org/wiki/Advanced_Encryption_Standard 541 | - https://en.wikipedia.org/wiki/AES_key_schedule 542 | - https://en.wikipedia.org/wiki/Substitution%E2%80%93permutation_network 543 | - https://en.wikipedia.org/wiki/Finite_field_arithmetic 544 | - https://en.wikipedia.org/wiki/Confusion_and_diffusion 545 | - https://en.wikipedia.org/wiki/Field_(mathematics) 546 | - https://en.wikipedia.org/wiki/Finite_field 547 | - https://en.wikipedia.org/wiki/GF(2) 548 | - https://en.wikipedia.org/wiki/Characteristic_(algebra) 549 | - https://words.filippo.io/the-ecb-penguin/ 550 | - https://www.youtube.com/watch?v=O4xNJsjtN6E 551 | -------------------------------------------------------------------------------- /2023/3/1/building-a-jank-uart-cable-from-scavenged-parts.md: -------------------------------------------------------------------------------- 1 | # Building A Jank UART to USB Cable From Scavenged Parts 2 | 3 | My home "lab" is, unfortunately, a manifestation of the unwinnable, uphill battle against entropy. The latest victim to the sprawl of boards, prototypes, and other miscellanea was my little Adafruit CP2104 USB to serial converter. As far as I can tell, it has literally dropped off the face of the earth. This is particularly irritating as I'm in the middle of another project for an upcoming video on the [channel](https://youtube.com/@lowbyteproductions), and needed that exact serial converter to test the implementation of a vital component in a system I'm building. 4 | 5 | Of course, the first thing I did was hop online to order a new converter. Shortly thereafter, however, I came to the sudden realisation that this is something *I should be able to make myself*. I mean, whats the point in having a lab in the first place if you're not going to use it for this kind of thing! This is a yak that simply needs to be shaved. 6 | 7 | ## What even is a USB to serial converter? 8 | 9 | Maybe I should contextualise this for a minute. What even *is* a serial converter? The high level overview is that "serial", or UART (*Universal Asynchronous Receive and Transmit*) is a protocol for exchanging data between two systems over two independent wires, with each side able to transmit and receive simultaneously, without those two lines necessarily being synchronised (hence "Asynchronous" in the name). This simple system has existed since at least as far back as the 1960s, and continues to be ubiquitous in the embedded world. A large part of that longevity is that it's such a simple protocol, from both a software and a hardware point of view. 10 | 11 | The thing is, our modern computers can't speak UART directly. They used to though; Before USB became the only port available on your laptop, RS-232 serial ports were extremely common (even my earliest computers had these ports, and I'm not *that* old). But even though USB is everywhere now, it's actually a pretty complex beast. Most microcontrollers don't offer any kind of USB peripheral support out of the box (though awesome projects like [TinyUSB](https://github.com/hathach/tinyusb) can help with that), but the problem of wanting to connect embedded devices to computers easily and cheaply still very much exists. This is why companies like FTDI exist. 12 | 13 | FTDI make a whole range of different chips that are able to take simple data streams like UART, and wrap them up into the complex world of USB. Even if you're not deep into embedded or electronics, you might recognise the name FTDI from their [pretty poor-taste scandal from a few years back](https://www.techrepublic.com/article/ftdi-abuses-windows-update-pushing-driver-that-breaks-counterfeit-chips/). Still, they're not the only ones in the market; Silicon Labs make the CP2104 I mentioned earlier, and the CH340 from Chinese manufacturer WCH is also pretty common. 14 | 15 | These kinds of chips are in a lot more devices than you might realise. After all, they are the one of the simplest ways to make your device user friendly, and often run on standard USB drivers. 16 | 17 | ## Salvaging parts 18 | 19 | I happened to have this old Arduino Duemilanove board just laying around, gathering dust, which has an FT232RL chip from FTDI to convert serial UART to USB. The FT232RL is one of the simplest and most common USB to UART chips out there, and is what the Arduino uses, together with the famed Arduino bootloader, to make it easy for people to upload code to the board with the push of a button. 20 | 21 |
22 | 23 | The first thing I did was fire up the hot-air rework station in order to desolder the chip, along with the two capacitors just above it. This is essentially a machine that can blow hot air out of a nozzle at a precisely controlled temperature (several hundred degrees), and a precisely controlled flow rate. This hot air is able to heat up all of the pins of the chip at once, melting the solder. To help the process along, I added some flux to all of the pins to get the solder to melt and flow better. Then it was a simple case of swiftly grabbing it with some tweezers and setting it aside. Flux is kind of nasty, and leaves the part sticky. I cleaned it up with a cotton bud and some alcohol. 24 | 25 |
26 | 27 | As a side note, those teeny-tiny capacitors act as a buffer for the power supplies that comes into the chip. If there are any ripples or disturbances on the line, they smooth those irregularities out, and provide a clean power signal. Since this is a story about building a jank cable, I didn't end up using them. If I encounter any problems in the future, I might stick them back in. 28 | 29 | You might have noticed this is an [SMD](https://nl.wikipedia.org/wiki/Surface-mounted_device) chip, which makes strapping a few pin headers to it a little challenging. That's where these awesome little SMD breakouts come in. A set of these is relatively inexpensive, and they basically allow you to turn any random chip into a dev board, which is pretty cool. 30 | 31 |
32 | 33 | The FT232RL is a 28-Pin SSOP chip, or ["shrink small-outline package"](https://en.wikipedia.org/wiki/Small_outline_integrated_circuit). This is the kind of thing you just learn to spot after a while, but you can always find this information in the chips datasheet. Or in a pinch, just line it up with the various sizes on the breakouts until you find one that fits! 34 | 35 | To solder the chip to the board, I could have used solder paste and the hot-air rework station, but I don't actually have any solder paste on hand right now. So instead, I put a fine tip on the soldering iron, and channeled my inner [Voultar](https://www.youtube.com/c/Voultar). This kind of relatively low stakes operation with a fine-pitch part is really great for improving technique. You start by really taking some time to line the chip up with the pads on the board, then with a really tiny amount of solder on the iron, carefully tack one corner pin to its corresponding pad. At this point, if anything is unaligned, you can easily hit it with the iron again and reposition, until it's tacked down and perfectly in place. 36 | 37 | When I had one corner down, I did the others in the same way. For the rest of the pins, I used some flux, and a drag soldering technique to get good contact with the pins to the pads. This essentially involves putting a little solder on the iron, and carefully "dragging" it across the pins, back and forth, until the solder is evenly distributed. The flux helps to direct the solder to the pins, but with such a fine pitch, solder bridges can easily form. Thankfully, they are just as easy to remove by cleaning the iron off, adding more flux, and dragging back and forth again. Rinse and repeat. 38 | 39 |
40 | 41 | With the chip nicely soldered, the next step was getting a USB cable attached. In non-jank UART to USB converters, you'll generally find a little micro USB port port on board so that you can bring your own cable to the party. I did spend a few minutes thinking about ways that I might be able to get a micro USB port rigged up, I quickly dismissed them in favour of simply cutting down an old USB cable and attaching the wires directly to the breakout. 42 | 43 | By taking a look at the [datasheet for the FT232RL](https://nl.mouser.com/datasheet/2/163/DS_FT232R-11534.pdf), I was able to locate the USB Data plus and minus signals on pins 15 and 16 respectively. The other signals of interest for this application are: 44 | 45 | - TXD on pin 1, the transmit pin 46 | - RXD on pin 5, the receive pin 47 | - VCCIO on pin 4, which is a reference voltage. The voltage on this pin will be used on TXD and RXD for data transfer. This is useful because it means the device can talk to 5V and 3.3V logic, both of which are common in electronics. 48 | - VCC on pin 20. This is the main power supply for the chip, and can be anything from 3.3V to 5.25V. Since USB already includes a 5V power rail, it made sense to power the chip from it. 49 | - GND, which appears on multiple pins. This provides a reference which other voltage levels are taken against. 50 | 51 |
52 | 53 | You'll notice there are a bunch of additional signals on this chip. Some of them are used for more complex serial communication, and some are configurable programmable pins that can used for extra functionality. I don't really need any of that though. 54 | 55 | For those that don't know, in its simplest form, USB data is transferred over a differential pair of twisted wires, that transmit the same signal, but with opposite polarity. This is necessary for keeping the signal integrity high, and is done with pretty much every high speed signalling solution. Note that this is a single signal transmitted across two wires, not a transmit and receive pair, like in UART. 56 | 57 | I had a USB charging cable that someone had given me some time ago, which I think was from a some kind of smart watch. It appeared to have all 4 lines connected, and I wasn't going to use it for it's intended purpose, so it seemed like a good candidate for salvage. 58 | 59 |
60 | 61 | After cutting the cable, I stripped back the insulation and shielding, and carefully exposed the internals of each of the wires, and then tinned them with a little solder. 62 | 63 |
64 | 65 | The colours of the wires are standard. Red and black are power and ground, and green and white are data plus and data minus. Ratcheting up the soldering difficulty just a little more, I soldered the two wires *directly* to the pins of the chip. Now, you're probably thinking to yourself: "Why would you not just solder to the nice, big, widely spaced holes for the pin headers?", which is a fair question. I suppose this the one place I opted for a little less jank. You see, differential pairs actually require the length of wires and signal traces on a PCB to be carefully controlled. In theory, if the plus wire was a little longer than the minus wire, the two signals would arrive at the chip at *very slightly* different times. That would mess up the signal integrity, and maybe even prevent the device from working. All of that said, I'm almost certain that this wasn't required at all. There's a [marcan](https://social.treehouse.systems/@marcan) quote out there somewhere, where he talks about being able to run slower speed USB signals over a pair of wet noodles, and I couldn't help but think of that here. Still, good soldering practice (those pads are 0.65mm apart)! 66 | 67 |
68 | 69 | 70 | On the reverse side of the breakout board, there is a footprint for a larger 28 SOP-type chip, which is of course wired to the same pins as the ones the FT232RL is attached to. I fed the power and ground cables to that side of the board and soldered them to the larger pads. 71 | 72 | All of the other signals of interest are on the left side of the chip, so I soldered in some regular 2.54mm pitch pin headers to be able to reach the TX and RX signals, as well as the VCCIO, and a ground connection to share with whatever device this one will be talking to. 73 | 74 |
75 | 76 | At this point, I had checked that nothing was shorted using continuity mode on my multimeter, and tried plugging it in to my laptop. After running `lsusb`, which lists the connected USB devices, I saw this: 77 | 78 |
79 | 80 | No magic smoke, and the 6th entry showed that it was working! Of course, I wanted to check that the actual UART communication was operational too, so I wrote the shortest Arduino sketch I could think of: 81 | 82 | ```C++ 83 | uint8_t value = 0; 84 | 85 | void setup() { 86 | Serial.begin(115200); 87 | } 88 | 89 | void loop() { 90 | char charToSend = 'A' + value; 91 | Serial.print(charToSend); 92 | 93 | if (++value >= 26) { 94 | value = 0; 95 | } 96 | 97 | delay(1000); 98 | } 99 | ``` 100 | 101 |
102 | 103 | 104 | After connecting the cable up to an Uno board, I ran `screen /dev/ttyUSB0 115200 8N1` to connect to the UART serial port, and was greeted by: 105 | 106 | ``` 107 | FGHIJKLMN 108 | ``` 109 | 110 | Notice the kapton tape and hot glue mess. That's really there to re-enforce the flimsy solder connections, and to stop anything from shorting later on, but it definitely helps with the scavenged look. 111 | 112 | So the jank cable was a success! I'm still happy that a professionally made adapter is coming in the next few days, but damn if it doesn't feel cool to know that I can whip up my own tools in a pinch. 113 | 114 |
115 | 116 | 117 | ## And if you're looking for some UART coolness 118 | 119 | [James Sharman](https://www.youtube.com/playlist?list=PLFhc0MFC8MiCs8W5H0qZlC6QNbpAAe_O-) has an excellent video series where he both explains and builds a UART implementation in hardware using discrete logic chips. James's channel is honestly criminally undersubscribed. All of his stuff is great. 120 | -------------------------------------------------------------------------------- /2024/11/1/sending-an-ethernet-packet.md: -------------------------------------------------------------------------------- 1 | # I sent an ethernet packet 2 | 3 | For as long as I've been making videos on the [low byte productions](https://youtube.com/@lowbyteproductions) youtube channel, I've wanted to make a series about "Networking from scratch", by which I mean building a full TCP/IP stack from the ground up on a microcontroller. It's been nearly 6 years now, and the past few days felt like as good a time as any to start. 4 | 5 | This blog entry is fairly limited in scope; On the surface, it's about how I successfully sent my first ethernet packet, but really it's a story about bugs and debugging, and some thoughts about overcoming challenges in projects. 6 | 7 | ## Microcontroller 8 | 9 | The microcontroller I'm using is an STM32F401 on a nucleo devboard. This is the same setup I used in the [Blinky to bootloader](https://www.youtube.com/watch?v=uQQsDWLRDuI&list=PLP29wDx6QmW7HaCrRydOnxcy8QmW0SNdQ) series, the [3D renderer on an oscilloscope](https://www.youtube.com/watch?v=TAfWea21ooM) video, and a bunch of others. It's an ARM Cortex-M4 that can run up to 84MHz, with 96KiB of RAM. That's enough to hold onto decent handful of packets. 10 | 11 | ## Ethernet 12 | 13 | Ethernet is a word that covers a surprising number of things. You might associate it with the port you plug into for wired internet, or know about the idea of ethernet *frames*. Ethernet is actually a whole family of technologies and standards (many of which are now obsolete) that encompasses the hardware involved at the physical level, various signalling formats that encode and transmit bits, the strategies for dealing with bus collisions, and the layout of frames, which contain and protect the data being sent from one place to another. 14 | 15 | Due to the complexity of the signalling involved with ethernet, a dedicated ASIC is generally used, which takes in data at the frame level, and takes care of wiggling the electrical lines in the right way over a cable ([though there are of course exceptions](https://www.youtube.com/watch?v=mwcvElQS-hM)). For this project, I'm using the W5100 chip from Wiznet, in the form of the original Arduino Ethernet shield. Well, it's a cheap knockoff that ended up needing some rework to actually function properly, but we'll get to that. 16 | 17 | 18 | 19 | The W5100 chip itself is pretty cool. It's essentially an ethernet ASIC with a hardware TCP/IP stack built-in. It has 4 "sockets", which can be set up to work at the TCP, UDP, IP, or "MAC Raw" levels. Since the whole point of this project is build a TCP/IP stack from scratch, I'm only making use of one socket in the MAC Raw mode, where I hand it ethernet frames, and it sends them out. The caveat is that an *actual* ethernet frame contains more than just the data; It also includes: A preamble, start of frame marker, and a 32-bit CRC. The preamble and start of frame marker are really only relevant at the electrical level, which the chip fully takes care, and the CRC is also helpfully computed by the W5100. 20 | 21 | ## Problem No. 1: Shouting into the void 22 | 23 | I coded up a driver to communicate with the W5100 chip. Data exchange takes place using SPI - the serial peripheral interface - which is a 4-wire signalling protocol for full duplex communication. One of the wires is for the main chip's (the microcontroller's) output, one is for the subordinate chip's (the W5100's) output, one is for the clock to which data is referenced, and the last is a "chip select" signal, which tells the subordinate chip that communication is happening. There are more details to SPI, which dictate things like if your data is referenced to the rising or falling edge of the clock, whether the clock signal idles high or low, whether bits are transferred MSB or LSB first, and how many bits occur per transfer (typically 8, or one byte). 24 | 25 | That gives you a low-level way to exchange bytes, but basically nothing else. The W5100 datasheet lays out a higher-level protocol on top of SPI to actually control the chip. In this protocol, you send commands made up of 4 bytes on the MOSI (main out, subordinate in) line: 26 | 27 | ``` 28 | +--------+-----------+----------+--------+ 29 | | byte 0 | byte 1 | byte 2 | byte 3 | 30 | +--------+-----------+----------+--------+ 31 | | op | addr high | addr low | value | 32 | +--------+-----------+----------+--------+ 33 | ``` 34 | 35 | The first byte specifies the operation, which is either a write (`0xf0`) or a read (`0x0f`) - anything else is invalid and should be ignored. The second and third bytes specify a 16-bit address (big-endian), and the last byte is, in the case of a write, a value to be written at that address. The W5100 transfers data out on MISO (main in, subordinate out) line at the same time, which follows this pattern: 36 | 37 | ``` 38 | +--------+--------+--------+--------+ 39 | | byte 0 | byte 1 | byte 2 | byte 3 | 40 | +--------+--------+--------+--------+ 41 | | 0x00 | 0x01 | 0x02 | 0x03 | 42 | +--------+--------+--------+--------+ 43 | ``` 44 | 45 | In the case of a read operation, byte 3 actually returns the value read at the specified address. 46 | 47 | What are these addresses? Well, internally, the W5100 has its own *address space*, which has transfer and receive buffers, as well as a bunch of registers for setting up and controlling the flow of packets. For reference, it's laid out like: 48 | 49 | ``` 50 | +-----------------+------------------+ 51 | | Address Range | Function | 52 | +-----------------+------------------+ 53 | | 0x0000 - 0x002f | Common registers | 54 | | 0x0030 - 0x03ff | Reserved | 55 | | 0x0400 - 0x07ff | Socket registers | 56 | | 0x0800 - 0x3fff | Reserved | 57 | | 0x4000 - 0x5fff | TX Memory | 58 | | 0x6000 - 0x7fff | RX Memory | 59 | | 0x8000 - 0xffff | Reserved | 60 | +-----------------+------------------+ 61 | ``` 62 | 63 | My first clue that something wasn't right was that I was sending out commands on MOSI, and seeing garbage on MISO. One thing I really like about this chip is that it specifically sends back a known value every time you clock out a byte, and so it's really easy to see that something is off in the communication. I puzzled over this for a while, and realised I'd fallen into the trap that you should always avoid when doing anything embedded - I unconditionally trusted the hardware. 64 | 65 | As I mentioned earlier, The W5100 is sitting on an arduino shield - which is an addon board meant to clip directly on top of the standard but absolutely awful arduino headers. Because Arduino is so ubiquitous, this shield format has found its way onto myriad other devboards that have nothing to do with arduino, simply because it gives you an instant tie-in to readily available peripheral modules. The nucleo devboards from ST also have headers to accommodate arduino shields, but alas, this particular shield had an annoying little trick up it's sleave. You see, the arduino also has another little 6-pin header on board called the ICSP - the in-circuit serial programmer. This header can be used to reprogram Atmel chips (which the tractional arduinos use) over SPI. The designers of this shield decided to make use of that 6-pin header, and the fact that it shares the same SPI signals with those on the standard header, and routed the SPI lines to the ICSP header **instead of** the standard pins on the arduino header. On a proper arduino, this would be no problem, because the arduino board itself internally connects those signals. But a nucleo board doesn't have an ICSP header, so the SPI signals I was sending out were going nowhere. 66 | 67 | 68 | 69 | I realised this after probing around with a multimeter, measuring resistances between the power rails and the SPI signals. If you measure *infinite* resistance, you know there is a problem! A quick trip to the soldering station, and a few enamelled copper wires later, I had bodged my board to properly connect everything to the nucleo. 70 | 71 | ## Problem No. 2: The essence of comedy 72 | 73 | With the SPI signals actually going where they should be, I was actually communicating with the device! I had enough code written to set up the W5100 to do raw ethernet transmissions, set up the MAC address, configured the TX and RX memory segments, and finally write a simple test ethernet frame into TX memory and trigger a transmission. I connected the nucleo+shield to my laptop with a CAT5 cable, fired up wireshark, and! Nothing. 74 | 75 | If you're regularly doing stuff down at the low-level, this kind of problem can be really daunting. There are no helpful (or even cryptic) error messages to investigate. No stack trace to follow. It's simply: You wiggled some electrons up and down across the wires, and it didn't do the thing you thought it would. 76 | 77 | One very useful tool to have in your belt for this kind of thing is a logic analyser, which is like an oscilloscope, but specifically made for analysing digital signals, i.e. ones that go between a logical high and low level. These devices have a number of channels - usually between 8 and 32 - and a USB connection. In the middle is some speedy magical silicon which buffers, conditions, and samples those channels at a high rate, compresses the bitstreams, and pushes them through the USB, ready to be analysed by software. The cool thing here is that you can investigate individual signals, or more usefully, groups of signals (like an entire SPI bus). In that case, the software will actually interpret the signals, and give you a display of the bytes sent in a transaction, as well as export tools for processing those same bytes as a CSV or other structured format. 78 | 79 | I have a Saleae Logic 8, which is a fairly fancy device, but you can get cheap and cheerful analysers for ~€10, which can also integrate with Saleae's Logic2 software (they're pretty iffy about working correctly and consistently, but there's always a tradeoff, right). As soon as I captured the SPI traffic flying across, I could see something was off. The first command I sent looked good, and so did the response over the MISO line. It's a software reset, and appears something like this: 80 | 81 | ``` 82 | +--------+--------+--------+--------+ 83 | | byte 0 | byte 1 | byte 2 | byte 3 | 84 | +------+--------+--------+--------+--------+ 85 | | MOSI | 0xf0 | 0x00 | 0x00 | 0x80 | 86 | +------+--------+--------+--------+--------+ 87 | | MISO | 0x00 | 0x01 | 0x02 | 0x03 | 88 | +------+--------+--------+--------+--------+ 89 | ``` 90 | 91 | Decoded, it's a `write` to address `0x0000`, the "Mode Register", with the most significant bit set, which triggers a software reset. You can also see that the device dutifully responded with the bytes `0x00` to `0x03`. The next command is a `read` from the same address, which is supposed to check if that same most significant bit has been cleared to 0, which indicates that the reset is complete. It looked like this: 92 | 93 | ``` 94 | +--------+--------+--------+--------+ 95 | | byte 0 | byte 1 | byte 2 | byte 3 | 96 | +------+--------+--------+--------+--------+ 97 | | MOSI | 0x0f | 0x00 | 0x00 | 0x00 | 98 | +------+--------+--------+--------+--------+ 99 | | MISO | 0x03 | 0xff | 0xff | 0xff | 100 | +------+--------+--------+--------+--------+ 101 | ``` 102 | 103 | Huh. The response on the MISO line is not the counting up behaviour, but instead it starts at `0x03` and then just responds forevermore with `0xff`. Not good. I pondered this one for a little while, and thankfully the little knowledge I have of digital logic, and actually building hardware on FPGAs came in clutch. The fact that the first value returned by the W5100 was `0x03`, which is the also the *last* value it returned, is a pretty big clue. 104 | 105 | Digital hardware, internally, is made up of circuits that implement useful primitives like "registers" and "flip-flops". These circuits are *synchronous* to a clock, and are able to capture/store/reset values presented to their inputs. You take these primitives, mix them together with others like multiplexers (the circuit equivalent of an if-statement), decoders, and of course logic gates, and you can build up to more complex constructions like state machines. The really important thing about all of this is that there are some pretty important timing-related constraints that need to be upheld. It takes a signal time to come out of a register, pass through a handful of logic gates/multiplexers/whatever, and then enter a new register. If, for example, the clock were to switch too fast, before the signal had made its way to the next register, the abstraction falls apart and it simply ain't gonna work. 106 | 107 | 108 | 109 | My thinking was that something like this was happening here; The W5100 didn't have enough time to do its thing, and that's why I was seeing the *previous* state of MISO (`0x03`) when the second command transaction was taking place. It probably wasn't going to be the SPI clock though, since the datasheet implies the maximum clock rate for the SPI bus is ~14MHz, and I was running way below that. Instead, I suspected that the chip select line was probably going high (i.e. being disabled) too soon, causing the chip to get into a bad state. 110 | 111 | I added a small loop to introduce a few microseconds of extra time before changing the state of the chip select line, and it worked! All of a sudden, the responses on the MISO line looked right. And further more, it seemed like the rest of the setup was also working, and I was getting sensible and sane values back from the chip when reading back configuration values I had written to it. 112 | 113 | I still haven't completely gotten to the bottom of the matter. Diligent readers will note that the datasheet does in fact contain a SPI timing diagram on page 66, but the only constraint mentioned there with regard to the chip select line was certainly being met. I should go back and try to figure out the exact bounds where things break down, but there are only a certain number of hours in the day, and right now I'd rather just progress with the project. 114 | 115 | ## Problem No. 3: There's life in this packet, but not as we know it 116 | 117 | Of course, the first thing I did was to fire up wireshark again, and see if that packet would appear. And it did! Sort of. Well, not really. 118 | 119 | *A* packet did appear, but not the one that I'd sent. Instead, at the moment the transmission command went out of the microcontroller, a raw ethernet packet, filled with garbage, and far larger than the one I had placed came out across the screen. 120 | 121 | 122 | 123 | I'd been trying to avoid reading other peoples code related to this chip, because a lot of the fun of this kind of project, for me, comes in the exploration and discovery of trying to get the thing working from specs and datasheets alone. But at this stage, I was willing to sanity check my understanding against something known to be working properly. 124 | 125 | A little tip that a lot of very seasoned embedded developers could learn from is that, at times like this, arduino actually comes in *really* useful. Grab an arduino, install some library, write ~5 lines of code (because all the complex stuff happens under the hood), and if you get the result, you now have a gold standard working thing to compare to. It actually took me quite a lot longer to find a library that would let me transmit a raw ethernet packet, however. As it turns out, most Arduino users, who want to be able to add network connectivity to a project, do not want to do it all from scratch. The official libraries actually removed support for sending raw packets from the public API. However I managed to find a [project on github](https://github.com/njh/W5100MacRaw) that was the minimum necessary steps to send and receive packets. 126 | 127 | I read the source, and compared against my own, and there were no immediately obvious offenders. Some of the register writes and reads occurred in a different order, some were ommitted in mine or theirs, but none of that seemed to be the problem. Normally sequence-dependent stuff like that is called out explicitly in the datasheet. Still, I changed my reads and writes to mimic theirs exactly, and for whatever reason I was still getting the garbage packet in wireshark. 128 | 129 | I decided at this point to try a different tactic. I believe in writing little tools to help guide development and debugging. This will be things like taking some low-level output, parsing it, and re-presenting it in a higher-level form. I typically write these little tools in python, and take the time to do argument parsing to properly document what comes in and what goes out. In this instance, I wanted to take the SPI capture CSV export coming the Saleae's Logic2 software, parse the bytes, and present a list of register writes and reads, with proper names, formatting, etc. The full tool is just over 200 lines of code, most of which is just a copy-paste of register names/addresses directly from the data sheet. The whole thing took me maybe one hour to put together. 130 | 131 | I took a capture of the gold-standard Arduino, as well as my garbage-producing implementation, and then passed the CSV data export of both through my tool. Running a simple diff, the problem jumped out immediately. I'll tell you now that this wasn't a glamourous bug, but then most aren't. My W5100 driver code contained a convienience function called `w5100_write16(u16 address, u16 value)`, which did two 8-bit write commands to two sequential registers. Many of the registers in the W5100 are actually 16-bit values, but the command format only allows for writing 8 bits at a time, so these values are split into high and low part. Long story short: My `w5100_write16()` function wrote to the same address twice, instead of the writing to `address + 1` for the second byte. In the processed logs from my tool, this was easy to see. The arduino looked like: 132 | 133 | ``` 134 | ... 135 | 5.3645086 : [w] S0_TX_WR0 [0x0424] 0x00 136 | 5.36453178: [w] S0_TX_WR1 [0x0425] 0x3c 137 | ... 138 | ``` 139 | 140 | And mine: 141 | 142 | ``` 143 | 8.45262964: [w] S0_TX_WR0 [0x0424] 0x00 144 | 8.45286726: [w] S0_TX_WR0 [0x0424] 0x3c 145 | ``` 146 | 147 | The details aren't important here, but this was a write to the *Socket 0 transmit write pointer* register. Remember earlier when I was talking about digital logic and state machines and abstractions breaking down etc? Well, this is actually another example of that, in a way. The W5100 expects you to read this 16 bit register (in order!), write bytes into TX memory, and then write back to this register with the new write pointer, and finally issue a socket command to *send*. By not writing the transmit write pointer in the correct order (in fact, not writing the second byte at all), and then issuing the send, I had put the chip into yet another weird and undefined state where it just did some random stuff. And of course, no errors, no stack traces - just weird effects. 148 | 149 | 150 | 151 | I fixed the function, and my test packet showed up in full glory in wireshark. To say I was happy to see it would be an understatement. 152 | 153 | ## The moral of the story? 154 | 155 | So I sent an ethernet packet. Not exactly the worlds biggest achievement when you realise how many of those are whizzing around in cables, passing through your body in the form of 2.4GHz radio waves right now, but certainly one that I was happy to call a win. 156 | 157 | Sometimes it's just fun to take the time to recount the numerous and often stupid bugs you encounter while working on a pet project - I know I love reading that from others, at least. But if I had any kind of point, it would probably be that spending the time to do things like write tools and explore the debugging space is pretty much **always worth it.** 158 | 159 | Strangely, I've found that there is a not-insignificant number of people are *against* this - especially in a professional environment. I think this has something to do with the JIRA-fication of development process, where all work is divided and sub-divided, and anything that does not directly check a box and produce a deliverable is wasted effort. It's disheartening that there is a whole group of people - developers and managers alike - who do not understand or see the value in *exploration as part of the process*. Writing a tool is often just as important and impactful as writing a test - more so even, if you don't fully understand the system you're working with yet. 160 | 161 | Debugging is, at its core, a manifestation of the scientific method; You gather data, make predictions, and test those predictions with experiments. Experiments yield new data, and you update your predictions, and hopefully come out the other side with a proper understanding. Tools like the one I wrote here are kind of akin to making plots of raw data. One good plot can tell you everything you need to know about your data. 162 | 163 | The project is making more significant progress now, and the new problems and new bugs are coming in at a higher abstraction level, which are more to do with (mis)understanding RFCs, and writing multitasking code than anything else. I might write more on the project at some point, but that'll be for another day. 164 | -------------------------------------------------------------------------------- /2024/11/26/getting-an-ip-address.md: -------------------------------------------------------------------------------- 1 | # I got an IP address 2 | 3 | This is a follow up to a previous blog entry, [I sent an ethernet packet](https://github.com/francisrstokes/githublog/blob/main/2024/11/1/sending-an-ethernet-packet.md). 4 | 5 | I'm writing a networking stack on a microcontroller. Not for production, or to make the *fastest/smallest footprint/insert metric here*, but just to get a deeper understanding about how things work all the way at the bottom, and hopefully to be able to make a video series out of the knowledge at some point in the future. 6 | 7 | I left off having successfully sent a test ethernet packet (or more pedantically, a *frame*, as a few hackernews commenters pointed out!), by talking a simple SPI protocol to an W5100 ethernet ASIC, where the packet was placed into its internal buffer and commanded to be sent out over the physical lines. The end goal, of course, is to have an operating TCP/IP networking stack, and to be able to do fun things like host a webserver, and make HTTP requests. 8 | 9 | But how do you get from sending the lowest level messages over a network, MAC address to MAC address, to things like hostnames, IP addresses, ports, and reliable network transmission? We're going to cover a little bit of that in this article - primarily how I was able to use DHCP to obtain an IP address from my home network, but also how the overall architecture of the firmware is laid out, and of course how I build some tooling to help debug the whole process. 10 | 11 | ## Architecture 12 | 13 | The firmware I wrote about in the previous article had only a single goal in life: to send a single ethernet packet. The only thing it did thereafter was to spend eternity spinning in a while loop. Going from that to something complex enough to handle several simultaneously running protocols running next to each other is not completely trivial. Some of the requirements I set down for the firmware architecture were: 14 | 15 | - It must have a single, low-level, hardware-independent facility for sending and receiving packets 16 | - Protocols (and other downstream network "users") should have independent processes 17 | - There must be a mechanism for processes to be informed about packets they care about 18 | - The low-level packet code should take care of buffering packets to be transmitted, and informing processes when their packet has been written to the interface 19 | 20 | I should note here that I'm not a networking expert at all, nor have I studied other networking stacks out there in the wild. This whole project is, like I mentioned, a way for me to dig deeper into this topic. As such, the architecture I've come up with might be way off base, or miss some important aspects. 21 | 22 | With requirements and disclaimers in mind, I chose to use FreeRTOS, a widely-used open-source real-time operating system. That gives me a straightforward way to create independent tasks (think very lightweight processes), and a kernel which can take care of switching them. Additionally, it provides some concurrency primitives and methods for inter-task communication. It is definitely *not* an operating system in the sense of linux or windows; By default, there are no drivers, no file systems, and no virtual memory to speak of. 23 | 24 | ![](../../../assets/w5100-project/architecture.png) 25 | 26 | This diagram shows a simplified view of the architecture design. The `Net` task takes care of the low-level packet wrangling. As you can see, it utilises abstraction layer in the form of a "Generic Ethernet Driver", which talks to the W5100 driver. This ensures that none of the networking code is tied to the specific ethernet ASIC I happen to be using. If tomorrow I decide to use an ENC28J60 chip, or the W5500 instead, I only have to rewrite that part, along with a little glue code to conform to the generic driver interface. 27 | 28 | ### Generic Ethernet Driver 29 | 30 | The generic driver itself follows a linux-style design, where any driver implements a specific interface made of 4 functions: 31 | 32 | - `void init(void)` 33 | - `bool ioctl(u32 request, void* arg)` 34 | - `bool read(void* packet)` 35 | - `bool write(void* packet)` 36 | 37 | `init`, `read`, and `write` are probably fairly self-explanatory. `ioctl`, or IO-control, is a kind of general side-band operation, where the thing you're trying to do is assigned a request number, and the `arg` pointer could be any relevant data or structure related to that operation. For example, setting the MAC address is assigned the number `1`, and the `arg` pointer in that case is expected to be pointing at a series of 6 bytes to use as the MAC address. For a more complex operation, the `arg` pointer might point at a structure, with some members being input parameters and others being outputs. Either way, if the operation is successful, it returns true. 38 | 39 | The driver implements this by creating a structure filled with function pointers: 40 | 41 | ```C 42 | typedef struct NetDriverInterface_t { 43 | void (*init)(void); 44 | bool (*ioctl)(u32 request, void* arg); 45 | bool (*read)(void* packet); 46 | bool (*write)(void* packet); 47 | } NetDriverInterface_t; 48 | ``` 49 | 50 | This kind of interface turns out to be incredibly universal. Almost every device you might imagine driving requires initialisation, has some notion of inputs and outputs (writes and reads), and everything else is jammed neatly into that overflowing kitchen drawer that is `ioctl`. 51 | 52 | ## Packet transmission 53 | 54 | Tasks other than `Net` do not call into the `NetDriverInterface_t` directly, however; That could easily cause havoc and race conditions, because you'd have multiple independent tasks attempting to use the same shared resource. Instead, the `Net` module is the owner of the driver, and exposes its own API for writing a packet: `net_transmit()`. Instead of immediately transmitting the packet, it is placed into a *FreeRTOS queue*, and picked up later when the `Net` task is free. FreeRTOS's queues are the core concurrency primitive in the RTOS. Under the hood, both semaphores and mutexes are also based on queues. 55 | 56 | While this solves the concurrency problem, it introduces a new problem; Namely, the task attempting to transmit a packet doesn't know when or if that packet actually got sent out of the interface. Because the queue is not immediately processed, and because the queue may already have items inside, there is an inherent delay, which the the transmitting task somehow needs to account for. More importantly, queues have a fixed size, and there may be no space for the packet at all - which means `net_transmit()` could fail. 57 | 58 | The signature of `net_transmit()` is: 59 | 60 | 61 | ```C 62 | bool net_transmit(NetTxPacket_t* packet); 63 | ``` 64 | 65 | In other words, the user passes a reference to some packet memory they've already arranged, and if the function returns false, the operation failed. If it returns true, it only means that the packet was placed into the queue successfully - not that it has been sent to the interface. For all intents and purposes, that is the same thing though. One thing I'm learning more viscerally than ever with this project is that everything about networking is assumed to be unreliable, right up until you get to a protocol like `TCP`, where that reliability is explicitly built in. If an operation fails for some reason, the sender or the receiver need to take care of that. 66 | 67 | Still, I wanted a way for transmitting modules to know that their packet had gone out of the interface at the very least. Taking a closer look at the `NetTxPacket_t` structure: 68 | 69 | ```C 70 | typedef struct NetTxPacket_t { 71 | volatile i32* complete; 72 | NetPacketBuffer_t packet; 73 | } NetTxPacket_t; 74 | ``` 75 | 76 | `complete` is a pointer to some memory where a completion flag will eventually be written. When someone calls `net_transmit()`, this is written with a `0`. When, eventually, the packet is taken out of the transmitting queue and sent over the interface, it is written with a `1`. I've left the possibility of writing negative values for errors, though I haven't defined any error states. There are some errors to handle in the W5100 driver, but they are still TODOs at this point. 77 | 78 | This provides the mechanism for a a transmitting task to block until the packet is sent: 79 | 80 | ```C 81 | ... 82 | // Keep trying to push the packet into the queue until successful 83 | while (!net_transmit(&tx_packet)) { 84 | vTaskDelay(1); 85 | } 86 | 87 | // Wait until the packet has either sent or errored out 88 | while (*tx_packet.complete == 0) { 89 | vTaskDelay(1); 90 | } 91 | ... 92 | ``` 93 | 94 | `vTaskDelay()` is a function provided by the FreeRTOS kernel, which puts this task to sleep for a number of "ticks", which is this case directly translate to milliseconds. It's a bit coarse, but fine for this project, where lightning fast efficiency is not required. 95 | 96 | Really sharp readers might be wondering why `complete` needs to be a pointer. Wouldn't it make more sense for it to just be an `i32`? Well, yes, except FreeRTOS queues work by copying data into the queue, not by placing a pointer to the data in the queue. This means that when the packet is eventually sent, the `Net` task could write to `complete`, but it wouldn't be the same `complete` flag the sender could check. Performance-wise, copying data around like this is not ideal scenario, but it does make some things a lot simpler. So instead, `complete` is a pointer, and the transmitter also has the choice to fill it in with `NULL`, in which case `Net` will skip writing to it. 97 | 98 | `NetTxPacket_t` contains a member called "packet", which is a `NetPacketBuffer_t` structure. This is defined as: 99 | 100 | ```C 101 | typedef struct NetPacketBuffer_t { 102 | u16 length; 103 | u8 buffer[1500]; 104 | } NetPacketBuffer_t; 105 | ``` 106 | 107 | This is the actual packet data. Ethernet packets can be up to 1500 bytes long, and this struct also keeps track of true size in the `length` field. 1500 bytes, or 1.46KiB might not seem like a whole lot of data, but it's a fair sized chunk on a microcontroller with just 96KiB of RAM. 108 | 109 | ## Receiving packets 110 | 111 | The `Net` task runs in a loop, and does just three things: 112 | 113 | - Transfers any packets currently in the packet buffer 114 | - Pulls the packets out of the ethernet interface, and hierarchically matches them to endpoints in the system 115 | - Sleeps for a short amount of time (1ms) 116 | 117 | Matching packets to some kind of handler in the system could be achieved in many different ways, and indeed, the method I settled on was not the first one I implemented. The idea is to have all of the protocols that sit directly on top of ethernet implement a function with this signature: 118 | 119 | ```C 120 | bool packet_match_fn(const NetPacketBuffer_t* packet); 121 | ``` 122 | 123 | These functions can examine the packet, and decide if they consider themselves to be the "owner" of said packet. 124 | 125 | Before even consulting the protocol-level packet matching functions, `Net` already rejects any packets that are not either a broadcast (i.e. the destination MAC address is `ff:ff:ff:ff:ff:ff`), or weren't sent specifically to this devices MAC address. Some ethernet ASICs can be configured to do this on the hardware level, but this functionality unfortunately does not exist in the W5100. 126 | 127 | `Net` then loops through all the known packet matching functions, and stops when one returns true. It is the responsibility of that function to push (copy) the packet into a queue, or to delegate the decision about ownership to an even higher-level protocol. 128 | 129 | ### IP 130 | 131 | Taking a look at the architecture diagram again: 132 | 133 | ![](../../../assets/w5100-project/architecture.png) 134 | 135 | You can see that the flow for a received packet is that it first calls the `IP` module's packet matching function. That function checks if this is an IPv4 packet, and if so, if the packet is an ethernet broadcast, an IP broadcast, or the IP matches what this device knows to be its own IP (we'll get to that later!). After that, it reads the protocol field of the IPv4 header, and delegates to the specific packet matching function of that protocol if one exists. 136 | 137 | If there is no match, it returns false, and `Net` consults the next packet matching function. Right now, there is no other protocol to check, but in the future, protocols like "ARP", the address resolution protocol, will have the next chance to examine the packet. 138 | 139 | For some context here, an IPv4 packet contains this data: 140 | 141 | ``` 142 | 0 1 2 3 143 | 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 144 | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 145 | |Version| IHL |Type of Service| Total Length | 146 | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 147 | | Identification |Flags| Fragment Offset | 148 | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 149 | | Time to Live | Protocol | Header Checksum | 150 | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 151 | | Source Address | 152 | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 153 | | Destination Address | 154 | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 155 | | Options (if IHL > 5) | 156 | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 157 | | Padding | 158 | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 159 | | Payload | 160 | | ... | 161 | | | 162 | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 163 | ``` 164 | 165 | IP, or the internet protocol, introduces the idea of "internet protocol addresses". For IPv4, these are 32-bit addresses which allow not only physically networked machines to communicate, but also whole networks to communicate with each other. IP packets can be routed around within networks, through routers and gateways, back into networks, and only ever have to resolve the physical MAC address at the very end of the journey. 166 | 167 | The [protocol](https://en.wikipedia.org/wiki/List_of_IP_protocol_numbers) field is one byte wide, and contains far more protocols than this device will ever support. The main items of interest for this project are `UDP` and `TCP`. 168 | 169 | ### UDP 170 | 171 | Going back to the packet matching process, the `IP` module would call the `UDP` ("user datagram protocol") module's packet matching function. `UDP` (like `TCP`) introduces the idea of "ports", which are one addressing layer deeper; Instead of addressing the machine (indirectly) through its IP, now the packet is addressing a particular *service* or endpoint on that computer. `UDP` is "connectionless", i.e. it doesn't perform handshaking and liveness checking between the two endpoints, and it doesn't make any guarantees about the packet reaching its destination. Finally, it is *message-oriented* as opposed to *stream-oriented*, which means that each UDP packet is assumed to be a standalone piece of data, rather than a small piece of a larger, sequential whole. 172 | 173 | While UDP is "connectionless" in the sense that it doesn't *actively* enforce rules about either endpoint, it only makes sense to match packets when there is a predefined idea of being connected to a service. 174 | 175 | For this reason, the `UDP` module exposes an API to register a "connection": 176 | 177 | ```C 178 | typedef struct UDPConnection_t { 179 | u16 src_port; 180 | u16 dst_port; 181 | u32 ip_address; 182 | u32 ip_mask; 183 | QueueHandle_t rx_queue; 184 | struct UDPConnection_t* next; 185 | } UDPConnection_t; 186 | 187 | bool udp_register_connection(UDPConnection_t* connection); 188 | ``` 189 | 190 | A module wanting to send an receive UDP packets (yes, I know they're called datagrams) creates a `UDPConnection_t` struct, and fills in the relevant details. The `QueueHandle_t` member is a FreeRTOS queue, which needs to be created separately. Note that `UDPConnection_t` is also a linked list node, with a pointer to the "next" `UDPConnection_t`. Internally, the `UDP` module keeps a linked list of known connections, and will manipulate the `next` pointer when `udp_register_connection` is called. This linked list is protected by a mutex, in case multiple modules try to call `udp_register_connection` (or `udp_close_connection`) at the same time. 191 | 192 | In the case of DHCP, the client (the one who is attempting to obtain an IP address in the network) is (usually) on port 68, and the server (the one handing out IP addresses) is on port 67, and the filled struct looks like: 193 | 194 | ```C 195 | UDPConnection_t dhcp_udp_connection = { 196 | .src_port = 68, 197 | .dst_port = 67, 198 | .ip_address = 0xffffffff, 199 | // We use UDP_IP_MASK_ANY here because the DHCP server will actually unicast to the offered IP address, 200 | // but we won't actually have that address until after we complete the DHCP negotiation 201 | .ip_mask = UDP_IP_MASK_ANY, 202 | 203 | // This is assigned later when the queue is created 204 | .rx_queue = NULL, 205 | .next = NULL, 206 | }; 207 | ``` 208 | 209 | DHCP is also a bit of a weird example to start this explanation with because it is a sort of chicken-and-egg situation. You don't yet have an IP address, but you need to use the IP network layer to obtain one! DHCP, which we'll get into in more detail shortly, has two kinds of "operations": `BOOTREQUEST` and `BOOTREPLY` (for historical reasons they are prefixed with "BOOT", as DHCP is an extension of a protocol called BOOTP). When a client sends a request, but does not have an IP address, the request is sent with the IP `0.0.0.0`, and is sent to the address `255.255.255.255`, which is the IP broadcast address. So in the above example, the IP address is set to `0xffffffff` (the u32 equivalent of `255.255.255.255`), and the `ip_mask` field is set to a special value `UDP_IP_MASK_ANY`, which indicates that incoming messages can be from any IP. Under normal circumstances, this `ip_mask` field can be used to match against one or more specific IPs to accept UDP data from. In the case of a single IP address, `ip_mask` and `ip_address` would simply be set to the same value. 210 | 211 | So with alllll of that in mind, we can unwind the stack to where the `UDP` modules packet matching function is determining the owner of the packet it has been passed by the `IP` module, who got it from `Net`. It traverses the linked list of connections, and attempts to find one where the ports and IPs match the message. If it finds one, it attempts to push into the associated queue, and returns `true` - indicating that this packet has found a home! If the queue is full, the packet is simply dropped - but this is OK, since `UDP` doesn't guarantee delivery. The client or server should simply try to send the packet again some time later if it doesn't receive a message it was expecting. 212 | 213 | Great! This gets a packet all the way out of the generic ethernet interface, and into the hands of the DHCP module, which itself can transmit UDP messages by calling into the `UDP` APIs, which call into the `IP` APIs, which eventually call into `net_transmit()`. Fantastic, so how does DHCP work? 214 | 215 | ## DHCP 216 | 217 | We're a few thousand words in, and now getting into the meat of how an IP address is actually allocated on a network. Like everything else in the world, there is a simple version (which we'll be looking at), and a plethora of far more complex, real-world versions that exist outside the home network. In principle, DHCP is a conversation that starts with a client shouting about their wish to be part of a network, and one or more DHCP servers responding like vendors at a bustling market. The client can take their pick of offers, and make a formal request for what the server advertised. Finally the server agrees, and the IP is leased to the client for a certain time period (sometimes, rarely, indefinitely). 218 | 219 | These exchanges take place using the DHCP packet format, which is carried in a UDP payload: 220 | 221 | ``` 222 | 0 1 2 3 223 | 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 224 | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 225 | | op | htype | hlen | hops | 226 | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 227 | | xid | 228 | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 229 | | secs | flags | 230 | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 231 | | ciaddr | 232 | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 233 | | yiaddr | 234 | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 235 | | siaddr | 236 | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 237 | | giaddr | 238 | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 239 | | | 240 | | chaddr (16 bytes) | 241 | | | 242 | | | 243 | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 244 | | | 245 | | BOOTP legacy | 246 | | (192 bytes) | 247 | | | 248 | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 249 | | magic cookie | 250 | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 251 | | options in TLV format | 252 | z ..... z 253 | | (variable length) | 254 | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 255 | ``` 256 | 257 | - The `op` field describes whether this is a request or a reply 258 | - `htype` and `hlen` are the hardware type and length, which in our case is ethernet and `6` 259 | - `hops` counts the number of times this packet passed through a *relay agent*, which is applicable in networks where a broadcast would not reach all network nodes due to physical segmentation 260 | - `xid` is a client-provided 32-bit unique identifier. All messages exchanged during a negotiation should use this same ID 261 | - `secs` tracks how many seconds have elapsed since the beginning of the exchange 262 | - `flags` allows the client to specify if replies should be sent unicast (i.e. directly) or broadcast 263 | - `ciaddr` is the clients valid IP address. At the beginning of the process, with no IP to speak of, this field will be `0` 264 | - `yiaddr` is "your" IP address, i.e. the one assigned by the server. When the client sends requests, this field is `0` 265 | - `siaddr` is the server IP address 266 | - `giaddr` is the gateway address - the address of the first *relay agent* the DHCP message passes through. In a home network situation, this will likely just be `0` 267 | - `chaddr` is the client hardware address. It is 16 bytes long in order to accommodate hardware addressing types other can ethernet, where the hardware ID may exceed 6 bytes. For ethernet, the remaining bytes are padding and can be set to `0` 268 | - The "BOOTP legacy" is not a real field, but rather a set of non-applicable fields which are not used for DHCP. The full 192 bytes can be set to `0` 269 | - `magic cookie` is a special byte sequence that identifies this as a DHCP message (`0x63825363`) 270 | - `options` contains extra data items in a TLV format (type/length/value). Despite the name, some of these are not optional, such as the specific type of DHCP message is specified here. 271 | 272 | TLV options come in the following form: 273 | 274 | - 1 byte for the option type 275 | - 1 byte for the length of the value 276 | - 0-255 bytes for the value 277 | 278 | As an example, option type `12` is "hostname", where a client can inform the server of their own name. If the hostname were "stm32eth", the bytes would look like: 279 | 280 | ``` 281 | +-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+ 282 | | Type | Len | Value | 283 | +-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+ 284 | | 0x0c | 0x08 | 0x73 | 0x74 | 0x6d | 0x33 | 0x32 | 0x65 | 0x74 | 0x68 | 285 | +-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+ 286 | ``` 287 | 288 | The end of the options section is specified with a `0xff` byte in what would be the next "type" field. 289 | 290 | Another important option is `55`, the "parameter request list", where the client supplies a list of values they'd like to receive from the server. These are things important details like the subnet mask, router address, and DNS server, but also [others](https://www.iana.org/assignments/bootp-dhcp-parameters/bootp-dhcp-parameters.xhtml) like NTP (network time protocol) servers. 291 | 292 | The DHCP client flow follows a state machine: 293 | 294 | ![](../../../assets/w5100-project/dhcp-state-machine.png) 295 | 296 | Clients start in the `Init` state, and send out a `DHCPDISCOVER` message. In a cold start condition, where the device has no IP preferences, the packet is set up as follows: 297 | 298 | - On the ethernet level 299 | - The destination MAC address is set to broadcast 300 | - The *ether type* (the lowest level protocol differentiator) is set to IPv4 301 | - On the IP level 302 | - The source IP is set to `0.0.0.0` 303 | - The destination IP is set to `255.255.255.255` (broadcast) 304 | - The protocol is set to the UDP 305 | - On the UDP level 306 | - The source port is set to `68` 307 | - The destination port is set to `67` 308 | - On the DHCP level 309 | - The `op` is set to request 310 | - The `xid` field is set to a "random" value 311 | - The `ciaddr`, `yiaddr`, `siaddr`, `giaddr` fields are all set to `0.0.0.0` 312 | - The `chaddr` is set to the devices configured MAC address (obtained with `net_get_mac_address()`) 313 | - In terms of options: 314 | - `53`: DHCP message type (`1`, DISCOVER) 315 | - `12`: Hostname ("stm32eth") 316 | - `55`: Parameter request list 317 | - Subnet mask 318 | - DNS server 319 | - Domain name 320 | - Broadcast address 321 | - Router address 322 | 323 | Note: Right now at this stage of development, the `DHCP` task registers a UDP connection, but that connection is only used to *receive* packets. Sending packets is done by constructing `NetTxPacket_t` structs and calling `net_transmit()`. In order to properly make use of the `udp_write` and `ip_transmit` APIs shown in the architecture diagram, the ARP module needs to be in place. We'll talk about ARP in the next post of this series. 324 | 325 | After sending a DHCPDISCOVER message, the client moves into the `Select` state. Since this is a broadcast message, every device (and therefore every DHCP server) on the physical network segment will receive the discover request. Any of those DHCP servers may choose to make an *offer* to the client. The offer is sent directly to the clients MAC address, with the source IP being the servers IP, and the destination IP being the one the server is offering to the client. On the UDP level, the source and destination ports are reversed from those in the discovery message. On the DHCP level: 326 | 327 | - The `op` is set to response 328 | - The `xid` is the one the client used 329 | - `ciaddr` is still `0.0.0.0` 330 | - `yiaddr` is set to the offered IP address 331 | - `siaddr` is set to the servers IP address 332 | - `giaddr` is set to the IP of the first relay agent, which on a simple network will not exist, and will be `0.0.0.0` 333 | - `chaddr` is set to the clients MAC 334 | - In terms of options: 335 | - `53`: DHCP message type (`2`, OFFER) 336 | - `51`: Lease time 337 | - `58`: Renewal time (50% of the lease time) 338 | - `59`: Rebinding time (87.5% of the lease time) 339 | - `1`: Subnet mask 340 | - `28`: Broadcast address 341 | - `3`: Router address 342 | 343 | A client can potentially spend some amount of time collecting up offers while in `Select`, and eventually chooses one server to request from, and moves to the `Request` state. In a simple setting, there will only be a single DHCP server, and therefore only a single offer. 344 | 345 | In the `Request` state, the client sends another broadcast packet, formally requesting what the server has offered. Most of the content of this packet is exactly the same as the one sent during `Init`, but with a few changes. The options are first cleared, and then set to: 346 | 347 | - `53`: DHCP message type (`3`, REQUEST) 348 | - `50`: Address request (the one offered by the server) 349 | - `54`: DHCP server identifier (IP address of the server) 350 | - `12`: Hostname (the client hostname, again) 351 | 352 | Finally, the client waits for an incoming DHCPACK message, which confirms the IP lease, and moves to the `Bound` state. 353 | 354 | The client can, after the *renewal time* has passed, seek to renew the lease. For the sake of brevity, I won't go into the full details now, but it involves sending a *unicast* DHCPREQUEST directly to the server - this time filling out the `ciaddr` field. The client can renew when the *renewal time* (50% of the lease time) has elapsed. When the *rebinding time* (87.5% of the lease time) has elapsed, and the client has attempted to renew without success, as a fallback, they can *broadcast* a DHCPREQUEST with the `ciaddr` field filled in, and hope that one of the DHCP servers will renew. If that fails, then the client must relinquish the IP when the lease elapses and start the whole process anew. 355 | 356 | ## Translating to code 357 | 358 | Using the above understanding, I wrote a DHCP task that would attempt to obtain an IP from the network. Below is more or less the actual task code in the firmware, which matches very closely to the state machine diagrammed above: 359 | 360 | ```C 361 | void dhcp_task_fn(void* params) { 362 | while (true) { 363 | if (!dhcp_rx_buffer_valid && xQueueReceive(dhcp_packet_rx_queue, &dhcp_rx_buffer, pdMS_TO_TICKS(10)) == pdTRUE) { 364 | dhcp_rx_buffer_valid = true; 365 | } 366 | 367 | switch (dhcp_state) { 368 | case DHCPState_TaskInit: dhcp_state_task_init(); break; 369 | case DHCPState_Init: dhcp_state_init(); break; 370 | case DHCPState_Select: dhcp_state_select(); break; 371 | case DHCPState_TxRequest: dhcp_state_tx_request(); break; 372 | case DHCPState_Request: dhcp_state_request(); break; 373 | case DHCPState_Bound: { /* Networked! */ } break; 374 | // Renew/Rebind omitted 375 | } 376 | 377 | vTaskDelay(pdMS_TO_TICKS(DHCP_TASK_INTERVAL_MS)); 378 | } 379 | } 380 | ``` 381 | 382 | As you can see, the function contains an infinite loop, where, at the beginning, it attempts to receive a packet from the queue if there is currently no valid DHCP packet in the buffer. During `DHCPState_Init`, the task calls into `dhcp_state_init()`, which will transmit a DHCPDISCOVER message. I'll include the full function in its final form here for context, but don't worry too much about trying to grok every line: 383 | 384 | ```C 385 | static void dhcp_state_init(void) { 386 | dhcp_last_message_time = xTaskGetTickCount(); 387 | 388 | dhcp_setup_base_packet(); 389 | 390 | // Write the discovery payload 391 | dhcp_tx_packet.packet.buffer[DHCP_OFFSET_OP] = DHCP_OP_BOOTREQUEST; 392 | dhcp_tx_packet.packet.buffer[DHCP_OFFSET_HTYPE] = DHCP_HTYPE_ETHERNET; 393 | dhcp_tx_packet.packet.buffer[DHCP_OFFSET_HLEN] = DHCP_HLEN_ETHERNET; 394 | dhcp_tx_packet.packet.buffer[DHCP_OFFSET_HOPS] = 0; 395 | 396 | dhcp_xid = dhcp_get_xid(); 397 | 398 | net_write_u32(&dhcp_tx_packet.packet.buffer[DHCP_OFFSET_XID], dhcp_xid); 399 | net_write_u16(&dhcp_tx_packet.packet.buffer[DHCP_OFFSET_SECS], 1); // Seconds 400 | net_write_u16(&dhcp_tx_packet.packet.buffer[DHCP_OFFSET_FLAGS], 0); // Flags 401 | net_write_u32(&dhcp_tx_packet.packet.buffer[DHCP_OFFSET_CIADDR], 0); // Client IP 402 | net_write_u32(&dhcp_tx_packet.packet.buffer[DHCP_OFFSET_YIADDR], 0); // Your IP 403 | net_write_u32(&dhcp_tx_packet.packet.buffer[DHCP_OFFSET_SIADDR], 0); // Server IP 404 | net_write_u32(&dhcp_tx_packet.packet.buffer[DHCP_OFFSET_GIADDR], 0); // Gateway IP 405 | memcpy(&dhcp_tx_packet.packet.buffer[DHCP_OFFSET_CHADDR], net_get_mac_address(), 6); // Client hardware address 406 | memset(&dhcp_tx_packet.packet.buffer[DHCP_OFFSET_CHADDR + 6], 0, 10); // The rest of the client hardware address can be zeroed out 407 | 408 | // All 0 padding, BOOTP legacy 409 | memset(&dhcp_tx_packet.packet.buffer[DHCP_OFFSET_BOOTP_LEGACY], 0, 192); 410 | net_write_u32(&dhcp_tx_packet.packet.buffer[DHCP_OFFSET_MAGIC_COOKIE], DHCP_MAGIC_COOKIE); 411 | 412 | // DHCP options 413 | // Message type: DHCP Discover 414 | u16 options_offset = dhcp_write_option_tlv( 415 | &dhcp_tx_packet.packet.buffer[DHCP_OFFSET_OPTIONS], 416 | DHCP_OPTION_MESSAGE_TYPE, 417 | 1, 418 | (u8[]){ DHCP_MESSAGE_TYPE_DISCOVER } 419 | ); 420 | 421 | // Hostname option 422 | options_offset += dhcp_write_option_tlv( 423 | &dhcp_tx_packet.packet.buffer[DHCP_OFFSET_OPTIONS + options_offset], 424 | DHCP_OPTION_HOSTNAME, 425 | strlen(dhcp_hostname), 426 | (u8*)dhcp_hostname 427 | ); 428 | 429 | // Requested parameters option 430 | const u8 requested_parameters[] = { 431 | DHCP_PARAMETER_SUBNET_MASK, 432 | DHCP_PARAMETER_DNS_SERVER, 433 | DHCP_PARAMETER_DOMAIN_NAME, 434 | DHCP_PARAMETER_BROADCAST_ADDRESS, 435 | DHCP_PARAMETER_ROUTER, 436 | }; 437 | options_offset += dhcp_write_option_tlv( 438 | &dhcp_tx_packet.packet.buffer[DHCP_OFFSET_OPTIONS + options_offset], 439 | DHCP_OPTION_PARAMETER_REQUEST_LIST, 440 | ARRAY_SIZE(requested_parameters), 441 | requested_parameters 442 | ); 443 | 444 | dhcp_tx_packet.packet.buffer[DHCP_OFFSET_OPTIONS + options_offset] = DHCP_OPTION_END; 445 | 446 | // Calculate lengths involved at ip, udp, and packet level 447 | u32 payload_length = 240 + options_offset + 1; 448 | ip_write_total_length(dhcp_tx_packet.packet.buffer, IP_HEADER_SIZE + UDP_HEADER_SIZE + payload_length); 449 | udp_write_payload_length(dhcp_tx_packet.packet.buffer, UDP_HEADER_SIZE + payload_length); 450 | 451 | // Compute the IPv4 checksum (the UDP checksum is optional) 452 | ip_compute_and_write_checksum(dhcp_tx_packet.packet.buffer); 453 | 454 | // Finally, set the packet buffer length 455 | dhcp_tx_packet.packet.length = UDP_PAYLOAD_OFFSET + payload_length; 456 | 457 | // Transmit the packet 458 | (void)net_transmit_blocking(&dhcp_tx_packet); 459 | if (*dhcp_tx_packet.complete == NET_COMPLETE_STATUS_ERR) { 460 | // We failed to transmit the packet, try again 461 | *dhcp_tx_packet.complete = 0; 462 | return; 463 | } 464 | 465 | // Move to the next state 466 | dhcp_state = DHCPState_Select; 467 | } 468 | ``` 469 | 470 | This function is 95% just writing some prescribed data into a buffer. Note the use of the `net_write_u32()` and `net_write_u16()` utility functions. While most CPUs these days deal with data stored and manipulated in little-endian form, everything in the networking world is done in big-endian. The ARM Cortex-M4 chip this is running on is also little-endian, so it's necessary to translate multi-byte values before transmitting and after receiving. 471 | 472 | When the packet data is written, and all the lengths and checksums have been written, it is transmitted using `net_transmit_blocking()`, which is a wrapper function around `net_transmit()` that keeps trying to push the packet to the transmit queue until it succeeds, and then waits for the `complete` pointer to be written with a non-zero value. 473 | 474 | During `DHCPState_Select`, the `dhcp_state_select()` waits for a valid received packet by checking `dhcp_rx_buffer_valid`, and returns if one is not present. Otherwise, it checks if the packet is an *offer*, notes down some of the provided values, and moves to the `DHCPState_TxRequest`. If no offer is found within a fixed time period, the state is reset to `DHCPState_Init`, and the process starts again from scratch, with a new `xid` value. 475 | 476 | `DHCPState_TxRequest` is not shown in the diagramed state machine, but I find the code easier to reason about if the state machine is only either transmitting a packet, or assessing a received packet - but not doing both. The body of the `dhcp_state_tx_request()` function is very similar to `dhcp_state_init()`, with a few options changed as described above. 477 | 478 | Finally the code moves into the `DHCPState_Request` state, where it waits for a DHCPACK to come in, or for a timer to elapse in order to restart the process. When it arrives, the `DHCP` module informs the `IP` module of pertinent details like the IP address it obtained, and the router address. 479 | 480 | Great! Except, of course, this didn't work first try. And in fact, just getting to a situation where I *could* test this out reliably took quite a bit of work. 481 | 482 | ## Debug Time 483 | 484 | How would you go about testing this? Sure, I could plug my board directly into my router, and start shooting out requests, right? This is a bad idea for a couple of reasons. I heard [Trammel Hudson](https://trmm.net/) say in a talk once that he didn't hack on his internet-connected washing machine at home because it was a "mission critical" device; Well, I probably feel even more strongly about the source of my internet connection than I do about my washing machine (which is *not* internet-connected, for the record). 485 | 486 | But even if I didn't feel that way, it's not a good strategy for testing this code because I have no *control* or *insight* into the DHCP process going on inside my router. 487 | 488 | So instead, the idea was to run a DHCP on my own laptop, which only interacted with USB-to-ethernet interface that I've been using for this project. If you're thinking about doing something like this yourself, these things are absolutely invaluable. They work right out of the box, and (with a little configuration) won't interfere with any other interface on your system (affecting the aforementioned mission-critical internet connection). 489 | 490 | After googling around for a while, I found two candidates for running a local DHCP server: **dnsmasq** and **systemd-networkd**. 491 | 492 | I started out with systemd-networkd, which is a service that is, surprise surprise, part of the systemd software suite. It's set up and configured using a `.network` file in the `/etc/systemd/network` folder: 493 | 494 | ``` 495 | [Match] 496 | Name=enx7cc2c65485c0 497 | 498 | [Network] 499 | Address=192.168.10.1/24 500 | DHCPServer=yes 501 | 502 | [DHCPServer] 503 | PoolOffset=2 504 | PoolSize=40 505 | DefaultLeaseTimeSec=43200 506 | MaxLeaseTimeSec=43200 507 | ``` 508 | 509 | This binds the configuration to the network interface `enx7cc2c65485c0`, and gives it a fixed IP of `192.168.10.1`, with a subnet mask of `255.255.255.0`. It enables the DHCP server and configures a few parameters, such as the pool size (how many contiguous IPs will be handed out starting at some offset). 510 | 511 | With wireshark sniffing the network traffic, I plugged in my device, and saw it spit out a DHCPDISCOVER message. That was a good sign! But as I waited, I didn't see an offer coming down the wire. I was right back in that zone of no errors, no warnings, only a thing that didn't do what I'd hoped it would. 512 | 513 | After poking around for a while, it didn't look there was any real way to see what was going on inside the systemd-networkd (so much for having a controllable black box). So I switched gears, rolled back my configuration there, and started setting up dnsmasq. 514 | 515 | From what I can tell, dnsmasq is a project that is favoured on platforms like raspberry pi. It's lightweight, not too complex, but offers fairly extensive possibilities for configuration. It's meant more as a DNS server (hence the name: dnsmasq), but includes a DHCP server as well. One of the absolute best things about this software is that the configuration file it creates by default (`/etc/dnsmasq.conf`) is fully self-documented with comments. I didn't have to look at any online documentation in order to: 516 | 517 | - disable the DNS server 518 | - bind to my desired interface 519 | - configure the DCHP address pool range and lease time 520 | - output the lease database directly to a file in my home directory 521 | - give extra verbose information in the logs about DHCP transactions 522 | 523 | The config file, with all comments removed, looks like this: 524 | 525 | ``` 526 | port=0 527 | no-resolv 528 | interface=enx7cc2c65485c0 529 | dhcp-range=192.168.10.2,192.168.10.50,12h 530 | dhcp-leasefile=/home/francis/dnsmasq.leases 531 | log-dhcp 532 | ``` 533 | 534 | I *did* have to manually assign an IP address and subnet mask to the interface using the `ip` command: 535 | 536 | ```bash 537 | sudo ip addr add 192.168.10.1/24 dev enx7cc2c65485c0 538 | ``` 539 | 540 | Configured, I again ran wireshark, connected the device, and! Same result. No offer packet! 541 | 542 | At this point, I got a little dejected. My packet looked fine in wireshark. There were no horrible red highlights showing an invalid packet - well, not after I fixed the most blindingly obvious ones! But I'd fixed those while running systemd-networkd. I set the project down, went to take a shower, and then an idea bubbled up through my subconscious in a way that can only happen when you put the problem out of your mind! I could connect the USB-to-ethernet interface I'd been using to my laptops dedicated PCI-E ethernet port! That way, the PCI-E interface would attempt to obtain an IP from the network it had just been plugged into, i.e. the DHCP server running from dnsmasq. 543 | 544 | I opened a terminal with tmux, and set up a split that would `watch cat dnsmasq.leases` in one view, and `watch sudo systemctl status dnsmasq` in another. Then I ran wireshark to capture traffic, and plugged the two ports into each other. Immediately, a DHCP transaction started taking place, and within a few seconds, my PCI-E interface had obtained an address! I could see the lease handed out in the watch window, as well as a log of the transactions passing by. I'm not quite sure why the interface ends up obtaining two IP addresses, but I'm, sure there is a good reason (let me know in the comments/issues). 545 | 546 | ![](../../../assets/w5100-project/connected-interfaces-dhcp.png) 547 | 548 | This felt like a real win, because it *proved* that my setup was configured correctly, and in doing so, eliminated a whole branch of possible things that could be going wrong. 549 | 550 | I compared wireshark captures of my devices failed DHCPDISCOVERY to the successful one I'd seen, and there were some differences. My firmware was sending just a few options, but my PCI-E interface quite a lot more. I decided to build a script in python that would let me construct a raw ethernet packet, and send it out of a specified interface. I already had the basis for this laying around [in a gist](https://gist.github.com/francisrstokes/6dacd3cfa90ec75a321c173071d4fd60) from some previous experiments. 551 | 552 | I set up a buffer using raw data I extracted from my wireshark logs. Then I set wireshark monitoring again, and ran the script. I saw the DHCPDISCOVER packet appear, but no offer in response. It took me a minute to realise my mistake. I was sending the packet out of the USB-to-ethernet interface, but the packets were not looping back, so of course I wasn't going to get a response! After changing the script to instead fire the packet out of my PCI-E interface, connecting the two interfaces together, I actually saw the thing working. I had to tweak the MAC address, of course, since the PCI-E interface was going to try to get a lease using the same one. 553 | 554 | After adding a few functions that would calculate lengths and checksums, and then write the correct values into back into the buffer, I could modify the packet at will - removing or adding options, and still having a valid packet to send. Slowly and methodically, I morphed the packet to exactly what I was sending from the firmware. And the crazy thing? It actually worked. I was getting an offer from the DHCP server. 555 | 556 | I was kind of stumped though, but I could also feel that I was getting really close. With two wireshark windows open side-to-side, I compared every single byte and field in the two packets. Halfway through the IP section, I realised my mistake: The IP checksum in my firmware's packet was 0 - empty. Somehow, I had forgotten the line: 557 | 558 | ```C 559 | ip_compute_and_write_checksum(dhcp_tx_packet.packet.buffer); 560 | ``` 561 | 562 | An oversight, no doubt. What confused me was that wireshark hadn't shown anything wrong with the packet at all! As it turns out, wireshark does *not validate checksums* by default. There are good reasons for that - if you're capturing a huge amount of traffic on a network for analysis, you don't necessarily care in that moment if everything is well-formed. 563 | 564 | I still learned a bunch of useful things through this exercise. With that one simple line added in the firmware, my DHCPDISCOVER message was getting a DHCPOFFER in return! Another 15 minutes of testing and tweaking, and I was the proud owner (well, leaser) of local network IP address **192.168.10.38**! 565 | 566 | ## Wrapping up 567 | 568 | This post (and of course the work itself) has a fairly different character to the previous one; Fewer soldering irons and timing issues, more "high level" firmware considerations, and understanding layers of the stack. Personally I love that feeling of moving from a beginners understanding, to one where I have enough overview to see how deep everything really goes. I know I'm only just beginning to scratch the surface, but I feel comfortable enough now to search around and dig in to details. 569 | 570 | I'm also learning just how powerful and versatile the networking tools on linux really are (and I haven't even touched eBPF on this project yet). Going forward, I expect this is going to be an area where I'm going to have to sharpen up quite a bit. The next layers of the stack are going to involve communicating with other devices on the network, including gateways and routers. If I want to be able to test and debug with the same level of control I've enjoyed here, I'm going to need to figure out how to host or simulate those in my isolated laptop environment. If you know anything about that, drop me a comment, or pass by on the [Low Byte Productions discord server](https://discord.gg/FPWaVgk). 571 | 572 | I hope you've enjoyed the journey. Next time we'll be climbing a little higher in the stack, and hopefully establishing an actual, two way connection! 573 | -------------------------------------------------------------------------------- /2024/5/10/cordic.md: -------------------------------------------------------------------------------- 1 | # Why the CORDIC algorithm lives rent-free in my head 2 | 3 | *This post is an adaptation of a [twitter thread](https://twitter.com/fstokesman/status/1787949934123049021) I put together a few days ago.* 4 | 5 | 6 | 7 | CORDIC is an algorithm for computing trig functions like `sin`, `cos`, `tan` etc on low powered hardware, without an FPU (i.e. no floating point) or expensive lookup tables. In fact, it reduces these complex functions to simple additions and bit shifts. 8 | 9 | I'll cut right to the chase and tell you *why* I love this algorithm so much, and then we'll dive into the details of exactly how it works. Essentially, the actual operations of the algorithm are incredibly simple - just shifts and adds, as I mentioned before - but it does this by combining vector math, trigonometry, convergence proofs, and some clever computer science. To me, it's what people are talking about when they describe things of this nature as "elegant". 10 | 11 | Let's start with an obvious point: You don't need this if you're working on high powered hardware. This technique is applicable for embedded environments; especially less capable microcontrollers and FPGAs. Even then, it's possible that more capable hardware/peripherals will be available which would be "faster", though speed is not the only measure of usefulness. 12 | 13 | ## Avoiding floating point 14 | 15 | *(if you're already familiar with fixed-point, you can safely skip this section)* 16 | 17 | You might be wondering how are we able to avoid floating point, when functions like `sin(x)` produce values between -1.0 and 1.0? Well, floating point is not the only way of representing rational numbers. In fact, before IEEE 754 became the popular standard that it is today, *fixed point* was used all the time (go and ask any gamedev who worked on stuff between 1980 and 2000ish and they'll tell you all about it). 18 | 19 | In fact, I got nerd-sniped into this whole CORDIC investigation after listening to [Dan Mangum's fantastic Microarch Club podcast](https://twitter.com/MicroarchClub/status/1759606520713453630), where Philip Freidin dropped the spicy hot-take that "Floating point is a crutch", and that using it might be a sign that you don't *really* understand the algorithm you're working on. Of course I should mention this was more in the context of custom ASICs rather than your run-of-the-mill webapp, but the quote really stuck with me. 20 | 21 | So how does fixed point work? Well you take an integer type like `int32_t`, and say the top 16 bits are the whole part of the number, and the bottom 16 bits are the fractional part. You could divide the number up differently (e.g. 10 bits for the whole part and 24 for the fractional), but we'll use 16/16 as an example here. 22 | 23 | 24 | 25 | That gives a range of around `-32768.99997` to `32767.99997`. We've *fixed* the radix point at the 16th bit, though again, we could have put it anywhere. Moving the point allows us to trade off for precision where it makes sense (i.e. more bits for whole numbers, or more bits for fractional representation). 26 | 27 | Something worth noting here is that the number is still an `int32_t` - we the programmers have assigned the extra meaning here (though this is also true of literally every data type in computing - there are only bits in the end!). 28 | 29 | 30 | 31 | How do we get a number into this format? Well, we've got 16 bits of fractional precision, so take a float like `42.01`, and scale it up by `(1 << 16)`. That gives us `2753167` when cast into an `int32_t`. If we want to go from fixed point back to floating point, we just do the opposite. `2753167 / (1 << 16)` gives us `~42.0099945`, which is very close to `42.01`. 32 | 33 | ```C 34 | #define SCALING_FACTOR (16) 35 | 36 | static inline int32_t fixed_from_float(float a) { 37 | return (int32_t)(a * (float)(1 << SCALING_FACTOR)); 38 | } 39 | 40 | static inline float fixed_to_float(int32_t a) { 41 | return (float)a / (float)(1 << SCALING_FACTOR); 42 | } 43 | ``` 44 | 45 | We could also forgo floating point altogether and encode a number like `1.5` manually. The whole part is just `1`, so we shift that up (`(1 << 16)`), and the fractional part is the halfway point between `0x0000` and `0xffff`, so call it `0x7fff`. That gives us `98303` in decimal. 46 | 47 | Operations like addition and subtraction Just Work™ - assuming you're using the same scaling factor for whichever numbers you're operating on. It is possible to mix and match scaling factors, but it increases the complexity. 48 | 49 | Multiplication is only marginally trickier. Multiplying the two fixed point numbers together essentially scales everything up by scaling factor. This can be resolved by just shifting the result back down. 50 | 51 | ```C 52 | static inline int32_t fixed_multiply(int32_t a, int32_t b) { 53 | return ((int64_t)a * (int64_t)b) >> SCALING_FACTOR; 54 | } 55 | ``` 56 | 57 | Division is basically the same story, except in reverse. There's a trick to squeeze out some extra precision by prescaling the dividend by the scaling factor, and then dividing by the divisor. 58 | 59 | ```C 60 | static inline int32_t fixed_divide(int32_t a, int32_t b) { 61 | return ((int64_t)a << SCALING_FACTOR) / (int64_t)b; 62 | } 63 | ``` 64 | 65 | OK we can do basic operations, but what if I need something more complex, like I don't know, a trig function? This is where CORDIC comes in. 66 | 67 | ## The CORDIC algorithm 68 | 69 | CORDIC stands for "co-ordinate rotation digital computer", and was cooked up back in the mid 50s (though the general algorithm has been known to mathematicians for hundreds of years). The core idea is that we can rotate a vector around a unit circle by progressively smaller and smaller angles, and the vector components will end up being the sine and cosine of the angle we're interested in. 70 | 71 | 72 | 73 | It's sort of like a binary search: You move towards the target angle by some large angle and check if you're ahead or behind, and then move by half that angle either clockwise or anticlockwise. This process repeats with smaller and smaller angles until the result converges. 74 | 75 | 76 | 77 | If you've worked with these kinds of operations before, you'll know that rotating a vector involves multiplying it with a matrix consisting of sines and cosines of the angle to be rotated to. That seems counter productive, since those are the functions we're trying to compute! 78 | 79 | $$ 80 | \begin{bmatrix} 81 | x' \\ 82 | y' 83 | \end{bmatrix} = \begin{bmatrix} 84 | \cos(\theta) & -\sin(\theta) \\ 85 | \sin(\theta) & \cos(\theta) 86 | \end{bmatrix} \begin{bmatrix} 87 | x \\ 88 | y 89 | v\end{bmatrix} 90 | $$ 91 | 92 | We'll put that aside for a second, and get a big picture overview before solving this problem. Now, it's fairly obvious that rotating by say `22.5˚` is the same as rotating by `45˚` and then `-22.5˚` - i.e. we can break up a rotation into smaller parts, with both positive and negative components. 93 | 94 | Let's say that we have a maximum rotation of `90˚` (𝚷/2 radians), and we're trying to figure out `sin(0.7)` (about `40˚`). Starting with a vector `(1, 0)` and a target of `0.7` radians, we rotate `0.7853` rads (`45˚`) anti-clockwise. 95 | 96 | 97 | 98 | Target now becomes `0.7 - 0.7853 = -0.0853`. Since it's negative, we now rotate clockwise by 0.3926 rads (22.5˚). Target becomes -0.0853 + 0.3926 = 0.3073, which is positive, so the next rotation will be anti-clockwise by 0.1963 rads (11.25˚). 99 | 100 | 101 | 102 | 103 | If we continue this process for a total of 16 iterations, the vector lines up almost perfectly with the original target angle. The `y` value of the vector is ~= `sin(a)`, while `x` ~= `cos(a)`! This is how CORDIC works; we rotate a vector around, and the state we keep is an approximation of various trigonometric functions. 104 | 105 | 106 | 107 | With some understanding in hand, we can return to the issue of, well, rotations actually requiring the functions we're trying to compute! We can use trigonometry to simplify the matrix. 108 | 109 | $$ 110 | \cos(\theta) = \frac{1}{\sqrt{1 + tan^2(\theta)}} 111 | $$ 112 | 113 | $$ 114 | \sin(\theta) = \frac{\tan(\theta)}{\sqrt{1 + tan^2(\theta)}} 115 | $$ 116 | 117 | $$ 118 | \begin{bmatrix} 119 | x' \\ 120 | y' 121 | \end{bmatrix} = \cos(\theta)\begin{bmatrix} 122 | 1 & -\tan(\theta) \\ 123 | \tan(\theta) & 1 124 | \end{bmatrix} \begin{bmatrix} 125 | x \\ 126 | y 127 | \end{bmatrix} 128 | $$ 129 | 130 | We have a few constants ones now, but we still have the `tan(a)`, plus the `cos(a)` out front. Let's ignore the `cos(a)` and focus on getting rid of `tan(a)`. As you saw when we ran through the algorithm, we're always rotating by a total of `~90˚`: First by `45˚`, then `22.5˚`, then `11.25˚`, and so on. Since we're doing this a fixed number of times, we can just precompute those values, and put them in a table. You might be thinking: *"You said there wouldn't be any tables!"*. Well, no. I said there wouldn't be any *expensive* tables. This table, in our case, will only contain 16 `uint32_t`s - a whopping 64 bytes! Even the most stripped down embedded projects can *usually* afford that. (In contrast, an *unoptimised* table for `sin(x)` that contains 4096 entries covering values from -1 to 1 would need 16KiB - and that's pretty poor precision!) 131 | 132 | $$ 133 | \begin{bmatrix} 134 | x' \\ 135 | y' 136 | \end{bmatrix} = \cos(\theta)\begin{bmatrix} 137 | 1 & -table[i] \\ 138 | table[i] & 1 139 | \end{bmatrix} \begin{bmatrix} 140 | x \\ 141 | y 142 | \end{bmatrix} 143 | $$ 144 | 145 | That means our rotation matrix now only contains constants! We do however still have that `cos(a)` term. In fact, every iteration brings it's own new `cos(a)` term. But because of algebra, we can simply multiply all those terms together and apply them at the end. 146 | 147 | $$ 148 | \cos(\theta_0) \cdot \cos(\theta_1) \cdot \cos(\theta_2) \cdot ... \cdot \cos(\theta_N) 149 | $$ 150 | 151 | Still, that's not great. But! No matter whether we take positive or negative steps, or the number of iterations, this multiplied out series of cosines actually converge to a constant value: `~0.6366`. All we need to do is to multiply out by this value after all iterations. 152 | 153 | $$ 154 | ~0.6366 = \cos(\pm45˚) \cdot \cos(\pm22.5˚) \cdot \cos(\pm11.25˚) \cdot ... \cdot \cos(\pm\theta_N) 155 | $$ 156 | 157 | So that gives us only multiplications by constants over a number of iterations! Not bad. But didn't I say that CORDIC only used bit shifts and addition? For that, we need to go a little deeper into the rabbit hole. 158 | 159 | ## Shifts and Adds 160 | 161 | What the angles we plugged into `tan(a)` could instead be strategically chosen so that the result would always be an inverse power-of-2? This would be great, since multiplying or dividing by a power-of-2 is just a left or right shift for integers. 162 | 163 | Well, the `atan(x)` (arc-tangent or inverse tangent) function can do that for us. We can build a new 16-entry table, where each value is `atan(2**-i)`, for i=0 to 15. The actual rotation values for each iteration are now (`45˚`, `26.565˚`, `14.036˚`, `7.125˚`, etc). 164 | 165 | It doesn't actually half the angle each time, but as it turns out: using these angles, the process will *still* converge on the correct result! Now all those multiplications by `tan(a)` have become bit shifts by the iteration number. 166 | 167 | We still need to recompute our constant for the `cos(a)` terms. That now comes out to be around `0.60725`, which would be converted to the fixed point number `39796`. 168 | And! It turns out there's a trick that means we don't even need to multiply by this value at the end. When we initialise the vector, we set `x` to this constant instead of 1. 169 | 170 | $$ 171 | ~0.60725 = \cos(\pm\arctan(2^{0})) \cdot \cos(\pm\arctan(2^{-1})) \cdot \cos(\pm\arctan(2^{-1})) \cdot ... \cdot \cos(\pm\arctan(2^{-N})) 172 | $$ 173 | 174 | So now the CORDIC algorithm looks like this: 175 | 176 | Precompute a table for `tan(a)`, where each entry is `atan(2**-i)`. These values are, of course, converted to fixed point, so: `atan(2**-i) * (1 << 16)` 177 | 178 | Then, when we want to compute a sine or a cosine, we take the angle (e.g. `0.9152`), convert it to fixed point: `0.9152 * (1 << 16) = 59978` 179 | 180 | Then setup initial parameters: 181 | 182 | ``` 183 | x = 39796 184 | y = 0 185 | z = 59978 186 | ``` 187 | 188 | The `z` parameter here is not part of the vector, but rather tracks our target angle over time. The sign of this parameter determines if we rotate clockwise or anti-clockwise. 189 | 190 | With the parameters set up, each iteration looks like this (in pseudocode): 191 | 192 | ```python 193 | if z >= 0: 194 | x_next = x - (y >> i) 195 | y_next = y + (x >> i) 196 | z -= table[i] 197 | else: 198 | x_next = x + (y >> i) 199 | y_next = y - (x >> i) 200 | z += table[i] 201 | x = x_next 202 | y = y_next 203 | ``` 204 | 205 | Now we can follow a few iterations through, and see the algorithm converge on the correct sine and cosine values. Values in parentheses are fixed point. 206 | 207 | 208 | 209 | During the first iteration, `z` was positive, so the vector is rotated anti-clockwise by `~0.785` rads. Note that the magnitude of the vector increased. 210 | 211 | 212 | 213 | In the second iteration, `z` was still positive, so again the vector is rotated anti-clockwise, by `~0.436` rads, though this time it overshot the mark. The magnitude of the vector is almost one now - that's the cos(a) product term starting to converge after we set the initial `x` value! 214 | 215 | 216 | 217 | On iteration 3, `z` was negative, so the vector is rotated clockwise by `~0.244` rads. It's clearly starting to creep up on that mark, and you can see that just a handful of iterations, we'd be able to get a fairly close approximation! 218 | 219 | 220 | 221 | On iteration 4, `z` was again negative, so clockwise rotation by `~0.124` rads. Now that the angular change is getting pretty small, and the vector is very close to the actual result, the rotations ping back and forth, getting closer and closer to the real value. 222 | 223 | 224 | 225 | Skipping forward to the last iteration, `y` now contains a very accurate approximation for `sin(0.9152)` - with an absolute deviation of just `0.00000956`. The cosine value (in `x`) deviation is slightly higher, at `0.0000434`, but still pretty good! 226 | 227 | 228 | 229 | ## Wrapping up 230 | 231 | There is _a lot_ more to CORDIC than this, which I may cover in a future post. For instance, I didn't mention the special considerations you have to make if the angle of interest is outside of the first or fourth quadrant of the unit circle. I also didn't talk about how, with a few modifications, CORDIC can be used to compute many other functions, including `tan`, `atan`, `asin`, `acos`, `sinh`, `cosh`, `tanh`, `sqrt`, `ln`, `e^x`. Related algorithms also exist, such as [BKM](https://en.wikipedia.org/wiki/BKM_algorithm), designed specifically for computing logs and exponentials. 232 | 233 | I'm planning on covering this in some detail on the [Low Byte Productions YouTube channel](https://www.youtube.com/@LowByteProductions?subscribe), so follow me there if this kind of thing is something you'd like to learn more about. 234 | -------------------------------------------------------------------------------- /2024/5/29/fast-inverse-sqrt.md: -------------------------------------------------------------------------------- 1 | # Everything I Know About The Fast Inverse Square Root Algorithm 2 | 3 | The **fast inverse square root algorithm**, made famous (though not invented) by programming legend John Carmack in the Quake 3 source code, computes an inverse square root $\frac{1}{\sqrt{x}}$ with a bewildering handful of lines that interpret and manipulate the raw bits of float. It's *wild*. 4 | 5 | ```C 6 | float Q_rsqrt(float number) { 7 | long i; 8 | float x2, y; 9 | const float threehalfs = 1.5F; 10 | 11 | x2 = number * 0.5F; 12 | y = number; 13 | i = *(long*)&y; // evil floating point bit level hacking 14 | i = 0x5f3759df - ( i >> 1 ); // what the fuck? 15 | y = *(float*)&i; 16 | y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration 17 | // y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed 18 | 19 | return y; 20 | } 21 | ``` 22 | 23 | In this article, we'll get into what's actually happening at the mathematical level in quite a bit of detail, and by the end, with a little persistence, you'll come away actually *understanding* how it works. 24 | 25 | I'm not the first to write about this algorithm, and I surely won't be the last, but my aim is to show *every* step of the process. A lot of really fantastic resources out there still skip over steps in derivations, or fail to highlight out apparently obvious points. My goal is to remove any and all magic from this crazy algorithm. 26 | 27 | It's important to note that this algorithm is very much *of its time*. Back when Quake 3 was released in 1999, computing an inverse square root was a slow, expensive process. The game had to compute hundreds or thousands of them per second in order to solve lighting equations, and other 3D vector calculations that rely on normalization. These days, on modern hardware, not only would a calculation like this not take place on the CPU, even if it did, it would be fast due to much more advanced dedicated floating point hardware. 28 | 29 | ## The algorithm 30 | 31 | ```C 32 | float Q_rsqrt(float number) { 33 | long i; 34 | float x2, y; 35 | const float threehalfs = 1.5F; 36 | 37 | x2 = number * 0.5F; 38 | y = number; 39 | i = *(long*)&y; // evil floating point bit level hacking 40 | i = 0x5f3759df - ( i >> 1 ); // what the fuck? 41 | y = *(float*)&i; 42 | y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration 43 | // y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed 44 | 45 | return y; 46 | } 47 | ``` 48 | 49 | This is the code, more or less exactly as it appears in the quake 3 source - including the comments. Personally I think *"evil floating point bit level hacking, what the fuck"* is a fantastic explanation, but I do want to dig in quite a bit further. 50 | 51 | One of the key ideas behind this algorithm, and the reason this works, is because the raw bit pattern of a float, when interpreted as 32-bit signed integer, essentially approximates a scaled and shifted `log2(x)` function. 52 | 53 | Logarithms have a bunch of rules, properties, and identities that can be exploited to make computing the inverse square root easy for a computer, using only simple operations like adds and shifts (though there are some supporting floating point multiplications, which we'll talk about later). 54 | 55 | In order to make sense of what it even means to interpret the bit pattern of a float, we need to look at how floats are represented in memory, and how the "value" of a float is derived from that representation. 56 | 57 | ## 32-Bit Floats: Representation 58 | 59 | An IEEE-754 32-bit float can be regarded as a struct, which holds 3 members. Using C's bit-field notation here: 60 | 61 | ```C 62 | struct float_raw { 63 | int32_t mantissa : 23; 64 | int32_t exponent : 8; 65 | int32_t sign : 1; 66 | } 67 | ``` 68 | 69 | **Sign:** 1 bit which indicates whether or not the number is positive or negative 70 | **Exponent**: 8 bits which are used to dictate the range that this number will fall into 71 | **Mantissa**: 23 bits which linearly specifies where exactly in the range this number lives 72 | 73 | The following equation shows how the actual numerical value $N$ is conceptually derived from the 3 integer parts, where $S$ is the sign bit, $E$ is the exponent value, and $M$ is the mantissa. 74 | 75 | $$ 76 | N = -1^S \times 2^{E-127} \times (1 + \frac{M}{2^{23}}) 77 | $$ 78 | 79 | Or, if we break some of the variables out: 80 | 81 | $$ 82 | m = \frac{M}{2^{23}} 83 | $$ 84 | 85 | $$ 86 | B = 127 87 | $$ 88 | 89 | $$ 90 | e = E-B 91 | $$ 92 | 93 | $$ 94 | N = -1^S \times 2^e \times (1 + m) 95 | $$ 96 | 97 | Note the little trick to get the sign bit from a 0 or a 1 into a -1 or a 1. Also notice that instead of simply multiplying by m, we multiply by 1+m. This ensures that when $m$ is 0, we get $2^e$, and when $m$ is 1, we get $2^{e+1}$ (i.e. the full range). 98 | 99 | Let's take an example like the (approximate) number `-1.724`. It's underlying representation would look like this: 100 | 101 | 102 | 103 | One interesting thing is that the exponent is actually stored in a biased format. The actual exponent value is $e = E - 127$. This allows two floating point numbers to be compared as if they were unsigned integers, which is a rather large benefit when it comes to building hardware accelerated floating point units. 104 | 105 | The next complexity is that an exponent $E$ of all zeros has a special meaning. All the numbers in this range are known as "sub-normals", and are represented by a slightly modified equation: 106 | 107 | $$ 108 | N = -1^S \times 2^{-126} \times m 109 | $$ 110 | 111 | 112 | 113 | The exponent is set to -126. The mantissa value doesn't have an added value of one (in fact it's implicitly $0 + m$), so the range actually represents 0 to just less than $2^{-126}$. Without this, it would be impossible to represent 0 or the very small numbers around 0, which can cause underflow errors when calculations on small numbers result in one of these impossible values. 114 | 115 | When the exponent $E$ is all ones, then the floating point value is one of the two other (quite famous) special types: `NaN` and `Infinity/-Infinity`. If $E = 255$, and $M = 0$, then the number represents an infinity, with the sign bit signifying positive or negative. 116 | 117 | 118 | 119 | $M \neq 0$, then the value is `NaN` (not a number), which is used to signify when an illegal operation has taken place, like $0/0$. 120 | 121 | 122 | 123 | ## 32-Bit Floats: Interpreting the bits 124 | 125 | Of course, typically this internal representation is completely irrelevant to the programmer; They can just perform calculations and get results. Indeed, William Kahan notes in his [1998 presentation "How Java’s Floating-Point Hurts Everyone Everywhere":](https://people.eecs.berkeley.edu/~wkahan/JAVAhurt.pdf) 126 | 127 | > Error-analysis tells us how to design floating-point arithmetic, like IEEE Standard 754, moderately tolerant of well-meaning ignorance among programmers 128 | 129 | The idea being that "numerical sophistication" is not necessary to make effective use of floating point. 130 | 131 | But that said, an intimate familiarity of the format can lead to some clever designs. We looked at how the integer parts of a float translate to a decimal number, but we can also talk about those same parts in terms of an integer representation in mathematical form: 132 | 133 | $$ 134 | I_x = 2^{31}S + 2^{23}E + M 135 | $$ 136 | 137 | That $2^{23}$ term is quite important, so let's break it out into its own variable: 138 | 139 | $$ 140 | L = 2^{23} 141 | $$ 142 | 143 | $$ 144 | I_x = 2^{31}S + LE + M 145 | $$ 146 | 147 | And since we're mainly going to be talking about taking square roots of numbers, we can assume the sign is positive ($S = 0$), and use the simpler form: 148 | 149 | $$ 150 | I_x = LE + M 151 | $$ 152 | 153 | If you take a closer look at the raw bits of a float, you can make some interesting observations. 154 | 155 | The first is that every *range* - that is the set of numbers which can be represented by any given exponent value - has approximately 50% less precision than the range before it. 156 | 157 | For example, take the exponent value $E = 127, e = E - B = 0$, which represents the range of representable numbers $\pm[1, 2)$. There are 8388607 (`(1 << 23) - 1`) distinct steps from 1 to just below 2. 158 | 159 | Contrast that with exponent value $E = 128, e = E - B = 1$, which represents the range of representable numbers $\pm[2, 4)$. It has the same 8388607 distinct steps, but it has to cover twice the distance on the number line. 160 | 161 | ## Raw bits as logarithms 162 | 163 | This relationship is *logarithmic*. If you take a series of evenly-spaced floating point numbers - say 256 of them - starting at 0, increasing by 0.25 each time, and interpret the bit pattern as an integer, you get the following graph: 164 | 165 | 166 | 167 | Now if we plot the result of taking `log2(x)` of those same 256 float values, we get this curve. 168 | 169 | 170 | 171 | Obviously the actual values on the graphs are wildly different, and the first one is much more *steppy*, but it's clear that the first is a close approximation of the second. 172 | 173 | The first graph is what you might call a *piecewise linear approximation*, which has been scaled and shifted by a specific amount. Perhaps unsurprisingly, the amount it's scaled and shifted by is related to the structure of a float! 174 | 175 | $$ 176 | \log_2(x) \approx \frac{I_x}{L} - B 177 | $$ 178 | 179 | Here, $I_x$ is the raw bit pattern of a float in integer form. That is divided by by size of the mantissa, and the bias exponent is subtracted away. If we plot this directly against `log2(x)`, we get: 180 | 181 | 182 | 183 | Again, not a perfect mapping, but a pretty good approximation! We can also sub in the floating point terms, assuming a positive sign bit and a *normal* number: 184 | 185 | $$ 186 | I_x = log_2(1 + m_x) + B \times 2^{e_x} 187 | $$ 188 | 189 | Forgetting the integer representation for just a second, the log of a floating point number alone would be expressed as: 190 | 191 | $$ 192 | log_2(1 + m_x) + e_x 193 | $$ 194 | 195 | Since we already know that the integer conversion is a *linear approximation*, we can make this approximate equivalence: 196 | 197 | $$ 198 | log_2(1 + x) \approx x + \sigma 199 | $$ 200 | 201 | The sigma ($\sigma$) term is essentially a fine adjustment parameter that can improve the approximation. To make it really concrete, the $x$ term here will always be a number in the range $[0,1]$, and represents a position in the exponent range *linearly*. 202 | 203 | With all of that in mind, we can focus our attention back on the thing we're (now indirectly) attempting to compute: $\frac{1}{\sqrt{x}}$. 204 | 205 | When work with the raw bits of a float, we are essentially operating on a logarithm of that value. Logarithms have been carefully studied for a long time, and they have many known properties and identities. For example: 206 | 207 | $$ 208 | \log_2(x^y) = y \times \log_2(x) 209 | $$ 210 | 211 | We can also note that: 212 | 213 | $$ 214 | \sqrt{x} = x^{0.5} 215 | $$ 216 | 217 | Since we're asking for an answer to the question: 218 | 219 | $$ 220 | y = \frac{1}{\sqrt{x}} 221 | $$ 222 | 223 | which we can reformulate as 224 | 225 | $$ 226 | y = \frac{1}{x^{0.5}} 227 | $$ 228 | 229 | and even simpler: 230 | 231 | $$ 232 | y = x^{-0.5} 233 | $$ 234 | 235 | We can take at our log formulas from before, and state that: 236 | 237 | $$ 238 | \log_2(\frac{1}{\sqrt{x}}) = \log_2(x^{-0.5}) = -0.5 \times \log_2(x) 239 | $$ 240 | 241 | Plugging in the floating point values now: 242 | 243 | $$ 244 | \log_2(1 + m_y) + e_y \approx -0.5 \times \log_2(1 + m_x) + e_x 245 | $$ 246 | 247 | $$ 248 | m_y + \sigma + e_y \approx -0.5 \times (m_x + \sigma + e_x) 249 | $$ 250 | 251 | And then getting the floating point constants back into their integer component form: 252 | 253 | $$ 254 | \frac{M_y}{L} + \sigma + E_y - B \approx -0.5 \times (\frac{M_x}{L} + \sigma + E_x - B) 255 | $$ 256 | 257 | We can do some algebra on this expression to turn it into one that where both sides have something that looks a raw floating point bit pattern (integer) on both sides ($LE + M$). I'm leaving every step in for clarity, though it's the last line here which is important: 258 | 259 | $$ 260 | \frac{M_y}{L} + \sigma + E_y \approx -0.5 \times (\frac{M_x}{L} + \sigma + E_x - B) + B 261 | $$ 262 | 263 | $$ 264 | \frac{M_y}{L} + E_y \approx -0.5 \times (\frac{M_x}{L} + \sigma + E_x - B) + B - \sigma 265 | $$ 266 | 267 | $$ 268 | \frac{M_y}{L} + E_y \approx -\frac{1}{2}(\frac{M_x}{L} + \sigma + E_x - B) + B - \sigma 269 | $$ 270 | 271 | $$ 272 | \frac{M_y}{L} + E_y \approx -\frac{1}{2}(\frac{M_x}{L} + E_x - B) + B - \frac{3}{2}\sigma 273 | $$ 274 | 275 | $$ 276 | \frac{M_y}{L} + E_y \approx -\frac{1}{2}(\frac{M_x}{L} + E_x) - \frac{3}{2}(\sigma - B) 277 | $$ 278 | 279 | $$ 280 | L(\frac{M_y}{L} + E_y) \approx L(-\frac{1}{2}(\frac{M_x}{L} + E_x) - \frac{3}{2}(\sigma - B)) 281 | $$ 282 | 283 | $$ 284 | L(\frac{M_y}{L} + E_y) \approx -\frac{1}{2}L(\frac{M_x}{L} + E_x) - \frac{3}{2}L(\sigma - B)) 285 | $$ 286 | 287 | $$ 288 | LE_y + M_y \approx -\frac{1}{2}(LE_x + M_x) - \frac{3}{2}L(\sigma - B)) 289 | $$ 290 | 291 | $$ 292 | LE_y + M_y \approx -\frac{1}{2}(LE_x + M_x) + \frac{3}{2}L(B - \sigma) 293 | $$ 294 | 295 | That is quite a mouthful - although all of the operations performed here are simple enough. With all the variable swapping done, and both sides containing groups that include proper, honest-to-goodness integer part floating point representations, we can group them back up and get: 296 | 297 | $$ 298 | I_y \approx -\frac{1}{2}I_x + \frac{3}{2}L(B - \sigma) 299 | $$ 300 | 301 | This is quite a significant moment. On the left hand side, we've got the *value* $\log_2(\frac{1}{\sqrt{x}})$, and on the right, we've got a simple operation on the integer interpretation of the floating point input (multiplying by negative one half), plus a constant term, made up of constants related to floating point representation (as well as the sigma tuning parameter). *This is the famous line*: 302 | 303 | ```C 304 | i = 0x5f3759df - ( i >> 1 ); // what the fuck? 305 | ``` 306 | 307 | A bit shift to the right multiplies by $\frac{1}{2}$, which is subtracted from the constant `0x5f3759df`. That hex constant is the $\frac{3}{2}L(B - \sigma)$ term, but where exactly does `0x5f3759df` come from? Assuming a sigma value $\sigma = 0$, we can compute: 308 | 309 | $$ 310 | \frac{3}{2}L(B - \sigma) = \frac{3}{2}2^{23} \times 127 = 1598029824 311 | $$ 312 | 313 | `1598029824` in hexadecimal is `0x5f400000`, which, as you'll note, is close to, but *not quite* the magic constant from Quake. It's off by `566817`, and we can use this information to compute the actual sigma value used in the game: 314 | 315 | $$ 316 | \frac{3}{2}2^{23} \times 127 - \frac{3}{2}2^{23}(127 - \sigma) = 566817 317 | $$ 318 | 319 | $$ 320 | \frac{3}{2}(2^{23} \times 127 - 2^{23}\times127 - 2^{23}(- \sigma)) = 566817 321 | $$ 322 | 323 | $$ 324 | \frac{3}{2}(-2^{23}(- \sigma)) = 566817 325 | $$ 326 | 327 | $$ 328 | -2^{23}(- \sigma) = \frac{566817}{1.5} 329 | $$ 330 | 331 | $$ 332 | 2^{23}\sigma = 377878 333 | $$ 334 | 335 | $$ 336 | \sigma = \frac{377878}{2^{23}} 337 | $$ 338 | 339 | $$ 340 | \sigma = 0.04504656 341 | $$ 342 | 343 | That sigma value was chosen by someone to optimise the approximation, but interestingly, it isn't actually the *optimal* value (more on that later), *and* it isn't actually known who came up with it! I've left all the math in so as to remove any possibility of this being a "magic" constant; It's really anything but! In C: 344 | 345 | ```C 346 | int32_t compute_magic(void) { 347 | double sigma = 0.0450465; 348 | double expression = 1.5 * pow(2.0, 23.0) * (127.0 - sigma); 349 | int32_t i = expression; 350 | return i; 351 | } 352 | 353 | // -> 0x5f3759df 354 | ``` 355 | 356 | Note that doubles are used here not floats, and that the integer form is just a plain old cast, not an interpretation of the bit pattern. 357 | 358 | ```C 359 | i = 0x5f3759df - ( i >> 1 ); // what the fuck? 360 | ``` 361 | 362 | That single line computes an inverse square root approximation on a floating point number by realising that the raw bit pattern is an approximate log, and then exploiting identities and algebra, as well as extremely fast operations like shifting and addition. 363 | 364 | I've often heard this algorithm referred to as a "hack". Now, I'm not one to put down a hacky solution, but a hack this is not. This is absolutely a solid, wel thought-out piece of engineering, employed to compute an expensive operation thousands of times per second on the under-powered hardware of the day. 365 | 366 | I'll make a quick note here that this algorithm will *only* work with "normal" floating point numbers. A "sub-normal" (that is, a tiny number *very* close to zero) will fall apart, because the log approximation assumes $log_2(1 + x) = x + \sigma$, but what we'd actually be plugging in would be $0 + x$. 367 | 368 | ## Newtons method 369 | 370 | The approximation described above is pretty good, but definitely contains measurable error. That's where the next line comes in. 371 | 372 | ```C 373 | y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration 374 | ``` 375 | 376 | This line improves the approximation by a significant margin, by utilising an algorithm called Newtons method, or the [Newton-Raphson method](https://en.wikipedia.org/wiki/Newton%27s_method) . This is a generic, iterative mathematical technique for finding the roots (zeros) of function. You might wonder how that could be helpful here, since we aren't looking for a zero. Well, we already have our approximation $y$, and we can create a new expression: 377 | 378 | $$ 379 | f(y) = \frac{1}{y^2} - x = 0 380 | $$ 381 | 382 | Squaring the $y$ term - which, remember, is $\frac{1}{\sqrt{x}}$ - gives us $\frac{1}{x} - x$. Inverting that gives us $x - x$, which is of course just 0. This expression is in a form that we can use for Newtons method. 383 | 384 | Newtons method works like this: Given an initial approximation $y_n$, we can create a better approximation $y_{n+1}$ like this: 385 | 386 | $$ 387 | y_{n+1} = y_n - \frac{f(y_n)}{f'(y_n)} 388 | $$ 389 | 390 | Where $f'(y)$ is the [*derivative*](https://en.wikipedia.org/wiki/Derivative) of $f(y)$. When we take the derivative of a function for a given input, we're determining the slope or gradient of function for that input. In other words, it's the rate of change. The improved approximation works by taking our current approximation (which we know is not yet correct), and nudging it along the slope towards the correct answer. It's kind of mind-boggling that this works, but there you go! I should note that this particular algorithm does not work for all circumstances, but it is a very powerful tool to throw at these kinds of problems! 391 | 392 | So what is the derivative of our $f(y)$ function? First let's rearrange the function a little: 393 | 394 | $$ 395 | f(y) = y^{-2} - x 396 | $$ 397 | 398 | Taking the derivative for a function in this form works like: 399 | 400 | $$ 401 | \frac{d}{dx}(x^n + c) = nx^{n-1} 402 | $$ 403 | 404 | So we get: 405 | 406 | $$ 407 | \frac{d}{dy} f(y) = -2y^{-3} 408 | $$ 409 | 410 | $$ 411 | \frac{d}{dy} f(y) = -2y^{-3} 412 | $$ 413 | 414 | $$ 415 | \frac{d}{dy} f(y) = -\frac{2}{y^3} 416 | $$ 417 | 418 | That gives us this expression for the better approximation: 419 | 420 | $$ 421 | y_{n+1} = y_n - \frac{\frac{1}{y_n^2} - x}{-\frac{2}{y_n^3}} 422 | $$ 423 | 424 | There is a problem with this form, however. A very large part of why this algorithm is fast is because it avoids floating point divisions, and the above equation has 3 of them! Fortunately, our very good friend algebra has our back again, and we can rearrange the expression into this form: 425 | 426 | $$ 427 | y_{n+1} = y_n(1.5 - 0.5x \cdot y_n^2) 428 | $$ 429 | 430 | No divisions, only multiplications! The exact steps to go from the first form to this one are *numerous* to say the least, but I've included it in full for completeness. Feel free to skip over it and pick back up on the code below. 431 | 432 | $$ 433 | y_{n+1} = y_n - \frac{\frac{1 - xy_n^2}{y_n^2}}{-\frac{2}{y_n^3}} 434 | $$ 435 | 436 | $$ 437 | y_{n+1} = y_n - \frac{1 - xy_n^2}{y_n^2} \cdot -\frac{y_n^3}{2} 438 | $$ 439 | 440 | $$ 441 | y_{n+1} = y_n - \frac{-1 - xy_n^2}{y_n^2} \cdot \frac{y_n^3}{2} 442 | $$ 443 | 444 | $$ 445 | y_{n+1} = y_n - \frac{-1 - xy_n^2}{\cancel{y_n^2}} \cdot \frac{\cancel{y_n^2}y_n}{2} 446 | $$ 447 | 448 | $$ 449 | y_{n+1} = y_n - (-1 - xy_n^2 \cdot \frac{y_n}{2}) 450 | $$ 451 | 452 | $$ 453 | y_{n+1} = y_n - (-(1-xy_n^2) \cdot \frac{y_n}{2}) 454 | $$ 455 | 456 | $$ 457 | y_{n+1} = y_n - (-1\cdot(1-xy_n^2) \cdot \frac{y_n}{2}) 458 | $$ 459 | 460 | $$ 461 | y_{n+1} = y_n - (-1\cdot1 + -1\cdot(-xy_n^2) \cdot \frac{y_n}{2}) 462 | $$ 463 | 464 | $$ 465 | y_{n+1} = y_n - (-1\frac{y_n}{2} + xy_n^2\frac{y_n}{2}) 466 | $$ 467 | 468 | $$ 469 | y_{n+1} = y_n - (-\frac{y_n}{2} + \frac{xy_n^2y_n}{2}) 470 | $$ 471 | 472 | $$ 473 | y_{n+1} = y_n - (\frac{-y_n+xy_n^3}{2}) 474 | $$ 475 | 476 | $$ 477 | y_{n+1} = y_n - (\frac{y_n \cdot -1 +xy_n^3)}{2}) 478 | $$ 479 | 480 | $$ 481 | y_{n+1} = y_n - (\frac{y_n \cdot -1 + y_n(xy_n^2)}{2}) 482 | $$ 483 | 484 | $$ 485 | y_{n+1} = y_n - (\frac{y_n(-1 + xy_n^2)}{2}) 486 | $$ 487 | 488 | $$ 489 | y_{n+1} = y_n \cdot \frac{2}{2} - \frac{y_n(-1 + xy_n^2)}{2} 490 | $$ 491 | 492 | $$ 493 | y_{n+1} = \frac{2y_n}{2} - \frac{y_n(-1 + xy_n^2)}{2} 494 | $$ 495 | 496 | $$ 497 | y_{n+1} = \frac{2y_n - y_n(-1 + xy_n^2)}{2} 498 | $$ 499 | 500 | $$ 501 | y_{n+1} = \frac{2y_n + y_n(-1(-1 + xy_n^2))}{2} 502 | $$ 503 | 504 | $$ 505 | y_{n+1} = \frac{y_n(2 -1(-1 + xy_n^2))}{2} 506 | $$ 507 | 508 | $$ 509 | y_{n+1} = \frac{y_n(2 + 1 -xy_n^2)}{2} 510 | $$ 511 | 512 | $$ 513 | y_{n+1} = \frac{y_n(3 -xy_n^2)}{2} 514 | $$ 515 | 516 | $$ 517 | y_{n+1} = y_n(\frac{3}{2} - \frac{xy_n^2}{2}) 518 | $$ 519 | 520 | $$ 521 | y_{n+1} = y_n(\frac{3}{2} - \frac{x}{2} y_n^2) 522 | $$ 523 | 524 | $$ 525 | y_{n+1} = y_n \cdot (1.5 - (0.5x \cdot y_n \cdot y_n)) 526 | $$ 527 | 528 | So that is the last line of the function before the return: 529 | 530 | ```C 531 | y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration 532 | ``` 533 | 534 | Amazingly, that ends up yielding a maximum absolute error of 0.175% (and often has a far lower error). Normally, Newtons method is applied iteratively to obtain closer and closer approximations, but in the case of the Quake code, only a single iteration was used. In the original source, a second iteration is present, but is commented out. 535 | 536 | ```C 537 | // y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed 538 | ``` 539 | 540 | ## Conclusion 541 | 542 | This algorithm is outright astonishing. It builds on a deep knowledge of the internal mathematical details of the floating point number system, understanding what runs fast and slow on a computer, some nimble algebraic gymnastics, and a centuries old root finding method discovered by none other than Issac Newton himself, and solves a problem that was a computational bottleneck at a particular period of history. 543 | 544 | I mentioned that Carmack did not actually come up with this himself (though I wouldn't put it past him!). [The truth is the exact origin is not 100% certain](https://www.beyond3d.com/content/articles/8/). There's something kind of incredible about that, too. 545 | 546 | And believe it or not, this rabbit hole actually goes even deeper. Mathematician [Chris Lomont wrote up a paper](http://www.lomont.org/papers/2003/InvSqrt.pdf) trying to find the optimal value for sigma in the log approximation step. It's definitely worth a look if this hasn't fully satisfied your curiosity about the subject. 547 | 548 | Lastly, [I recently wrote about CORDIC](https://github.com/francisrstokes/githublog/blob/main/2024/5/10/cordic.md), an algorithm for computing sines and cosines without floats, using only addition and bit shifting. Some folks had asked in the comments about its similarity to the fast inverse square root algorithm. I replied that it wasn't that similar, *really* - being all about floating point, bit level interpretations, and root-finding. 549 | 550 | But then I stopped to actually think about it, and while there are large differences in the details of the algorithm, there is a lot of *spirit* in common. Specifically, making clever mathematical observations, and bringing those to bear efficiently on the hardware constraints of the time. 551 | 552 | Some people look at algorithms like CORDIC and fast inverse square root, and think them only relics of the past; A technology with no utility in the modern world. I don't think I have to tell you that I disagree with that premise. 553 | 554 | A lot of us get into this field because, as kids, we loved to crack things open and see how they worked (even if, sometimes, we couldn't put them back together afterwards). Algorithms such as these live in that same space for me. I've tried to keep that curious spark alive, and turn it on problems and technology that aren't immediately relevant to my everyday work. And the really crazy thing is that often the underlying elements *do* help me solve real problems! Knowledge is synthesisable, who would have thought. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # The Worlds Simplest Blog 2 | 3 | Hi, I'm Francis Stokes 👋 I'm sick of complex blogging solutions, so markdown files in a git repo it is. 4 | 5 | ## RSS Feed 6 | 7 | [An RSS feed for this blog is available.](https://raw.githubusercontent.com/francisrstokes/githublog/main/feed.xml) 8 | 9 | The [feed generator is bespoke](./feed-builder/index.ts), and why you might argue that such a feature is adding complex to something that's meant to be simple, my response would be: *whatever* 🤷‍♂️ 10 | 11 | ## Articles 12 | 13 | - [I got an IP address](./2024/11/26/getting-an-ip-address.md) [26/11/2024] 14 | - [I sent an ethernet packet](./2024/11/1/sending-an-ethernet-packet.md) [1/11/2024] 15 | - [Everything I Know About The Fast Inverse Square Root Algorithm](./2024/5/29/fast-inverse-sqrt.md) [29/5/2024] 16 | - [Why the CORDIC algorithm lives rent-free in my head](./2024/5/10/cordic.md) [10/5/2024] 17 | - [Building A Jank UART to USB Cable From Scavenged Parts](./2023/3/1/building-a-jank-uart-cable-from-scavenged-parts.md) [3/1/2023] 18 | - [Rolling your own crypto: Everything you need to build AES from scratch (and then never use it for anything of consequence)](./2022/6/15/rolling-your-own-crypto-aes.md) [15/06/2022] 19 | - [Notes on a lateral career move](./2022/4/29/notes-on-a-lateral-career-move.md) [29/04/2022] 20 | - [Doing more than one thing at once](./2021/12/14/doing-more-than-one-thing.md) [14/12/2021] 21 | -------------------------------------------------------------------------------- /assets/AES-Block.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/AES-Block.png -------------------------------------------------------------------------------- /assets/AES-CBC.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/AES-CBC.png -------------------------------------------------------------------------------- /assets/AES-Key_Schedule_128-bit_key.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/AES-Key_Schedule_128-bit_key.png -------------------------------------------------------------------------------- /assets/AES-Padding-Blocks.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/AES-Padding-Blocks.png -------------------------------------------------------------------------------- /assets/AES-RotWord.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/AES-RotWord.png -------------------------------------------------------------------------------- /assets/AES-s-box.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/AES-s-box.png -------------------------------------------------------------------------------- /assets/AES-shift-rows.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/AES-shift-rows.png -------------------------------------------------------------------------------- /assets/ArduinoRedGreenLEDs.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/ArduinoRedGreenLEDs.jpg -------------------------------------------------------------------------------- /assets/PinoutArduinoUno.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/PinoutArduinoUno.png -------------------------------------------------------------------------------- /assets/StackFrames-Improved.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/StackFrames-Improved.png -------------------------------------------------------------------------------- /assets/StackFrames-Save.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/StackFrames-Save.png -------------------------------------------------------------------------------- /assets/StackFrames-Wasteful.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/StackFrames-Wasteful.png -------------------------------------------------------------------------------- /assets/Tux.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/Tux.png -------------------------------------------------------------------------------- /assets/Uno.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/Uno.png -------------------------------------------------------------------------------- /assets/cordic/0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/cordic/0.png -------------------------------------------------------------------------------- /assets/cordic/0r.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/cordic/0r.png -------------------------------------------------------------------------------- /assets/cordic/1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/cordic/1.png -------------------------------------------------------------------------------- /assets/cordic/16.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/cordic/16.png -------------------------------------------------------------------------------- /assets/cordic/1r.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/cordic/1r.png -------------------------------------------------------------------------------- /assets/cordic/2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/cordic/2.png -------------------------------------------------------------------------------- /assets/cordic/2r.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/cordic/2r.png -------------------------------------------------------------------------------- /assets/cordic/3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/cordic/3.png -------------------------------------------------------------------------------- /assets/cordic/3r.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/cordic/3r.png -------------------------------------------------------------------------------- /assets/cordic/4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/cordic/4.png -------------------------------------------------------------------------------- /assets/cordic/binary-search.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/cordic/binary-search.png -------------------------------------------------------------------------------- /assets/cordic/cordic.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/cordic/cordic.gif -------------------------------------------------------------------------------- /assets/cordic/fixed-point.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/cordic/fixed-point.png -------------------------------------------------------------------------------- /assets/cordic/fixed-whole-fractional.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/cordic/fixed-whole-fractional.png -------------------------------------------------------------------------------- /assets/fast-inverse-sqrt/floats-as-ints.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/fast-inverse-sqrt/floats-as-ints.png -------------------------------------------------------------------------------- /assets/fast-inverse-sqrt/infinity.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/fast-inverse-sqrt/infinity.png -------------------------------------------------------------------------------- /assets/fast-inverse-sqrt/log2-vs-ints.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/fast-inverse-sqrt/log2-vs-ints.png -------------------------------------------------------------------------------- /assets/fast-inverse-sqrt/log2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/fast-inverse-sqrt/log2.png -------------------------------------------------------------------------------- /assets/fast-inverse-sqrt/nan.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/fast-inverse-sqrt/nan.png -------------------------------------------------------------------------------- /assets/fast-inverse-sqrt/normal.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/fast-inverse-sqrt/normal.png -------------------------------------------------------------------------------- /assets/fast-inverse-sqrt/subnormal.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/fast-inverse-sqrt/subnormal.png -------------------------------------------------------------------------------- /assets/high-level-state-diagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/high-level-state-diagram.png -------------------------------------------------------------------------------- /assets/serial/arduino-duemilanove.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/serial/arduino-duemilanove.jpg -------------------------------------------------------------------------------- /assets/serial/cable-complete.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/serial/cable-complete.jpg -------------------------------------------------------------------------------- /assets/serial/cable-cut.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/serial/cable-cut.jpg -------------------------------------------------------------------------------- /assets/serial/cable-tinned.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/serial/cable-tinned.jpg -------------------------------------------------------------------------------- /assets/serial/chip-removed.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/serial/chip-removed.jpg -------------------------------------------------------------------------------- /assets/serial/chip-soldered.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/serial/chip-soldered.jpg -------------------------------------------------------------------------------- /assets/serial/ft232rl-pinout.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/serial/ft232rl-pinout.png -------------------------------------------------------------------------------- /assets/serial/ft232rl-terminal.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/serial/ft232rl-terminal.png -------------------------------------------------------------------------------- /assets/serial/smd-breakouts.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/serial/smd-breakouts.jpg -------------------------------------------------------------------------------- /assets/serial/usb-charger.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/serial/usb-charger.jpg -------------------------------------------------------------------------------- /assets/serial/usb-soldered.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/serial/usb-soldered.jpg -------------------------------------------------------------------------------- /assets/serial/with-arduino.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/serial/with-arduino.jpg -------------------------------------------------------------------------------- /assets/serial/with-headers.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/serial/with-headers.jpg -------------------------------------------------------------------------------- /assets/tux.cbc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/tux.cbc.png -------------------------------------------------------------------------------- /assets/tux.enc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/tux.enc.png -------------------------------------------------------------------------------- /assets/tux.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/tux.gif -------------------------------------------------------------------------------- /assets/w5100-project/architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/w5100-project/architecture.png -------------------------------------------------------------------------------- /assets/w5100-project/bodge-wires.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/w5100-project/bodge-wires.jpg -------------------------------------------------------------------------------- /assets/w5100-project/connected-interfaces-dhcp.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/w5100-project/connected-interfaces-dhcp.png -------------------------------------------------------------------------------- /assets/w5100-project/dhcp-state-machine.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/w5100-project/dhcp-state-machine.png -------------------------------------------------------------------------------- /assets/w5100-project/first-actual-packet.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/w5100-project/first-actual-packet.png -------------------------------------------------------------------------------- /assets/w5100-project/first-unsuccessful-packet.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/w5100-project/first-unsuccessful-packet.png -------------------------------------------------------------------------------- /assets/w5100-project/shield.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/w5100-project/shield.jpg -------------------------------------------------------------------------------- /assets/w5100-project/with-saleae.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/francisrstokes/githublog/5d1926c420d8c5167c69943a3e4ce7d75b4fec6f/assets/w5100-project/with-saleae.jpg -------------------------------------------------------------------------------- /feed-builder/index.ts: -------------------------------------------------------------------------------- 1 | import * as fs from 'fs/promises'; 2 | import * as path from 'path'; 3 | 4 | // Libs for processing markdown content 5 | import {remark} from 'remark'; 6 | import strip from 'strip-markdown'; 7 | 8 | // Sigh... 9 | import { fileURLToPath } from 'url'; 10 | const __filename = fileURLToPath(import.meta.url); 11 | const __dirname = path.dirname(__filename); 12 | 13 | const yearRegex = /^2[0-9]{3}$/; // Only expect blogs written in this millennium 14 | const repoDir = path.join(__dirname, '..'); 15 | const feedPath = path.join(repoDir, 'feed.xml'); 16 | 17 | const githubURLBase = 'https://github.com/francisrstokes/githublog'; 18 | const githubURLPrefix = `${githubURLBase}/blob/main`; 19 | 20 | const blogTitle = 'Francis Stokes :: Githublog'; 21 | const blogDescription = "I'm sick of complex blogging solutions, so markdown files in a git repo it is."; 22 | const copyrightField = `${new Date().getFullYear()} Francis Stokes - All rights reserved`; 23 | const lastBuildDate = new Date().toUTCString(); 24 | const ttl = 86400 / 60; // 1 day, in minutes 25 | 26 | const descriptionCharsToUse = 420; 27 | 28 | // Pff, use a real XML library? What does this look like? An actual blog framework?! 29 | const rssTemplate = ` 30 | 31 | 32 | ${blogTitle} 33 | ${blogDescription} 34 | ${githubURLBase} 35 | ${copyrightField} 36 | ${lastBuildDate} 37 | ${lastBuildDate} 38 | ${ttl} 39 | 40 | {blog_entries} 41 | 42 | 43 | 44 | `; 45 | 46 | const stripMarkdown = (text: string) => remark().use(strip).process(text).then(String); 47 | 48 | type PostInfo = { 49 | title: string; 50 | description: string; 51 | }; 52 | const extractPostInfo = async (markdownContents: string): Promise => { 53 | const contentWithoutEmptyLines = markdownContents.split('\n').filter(Boolean); 54 | 55 | const title = await stripMarkdown(contentWithoutEmptyLines[0]); 56 | 57 | const nonTitleContent = contentWithoutEmptyLines.slice(1) 58 | .join('\n') 59 | .slice(0, descriptionCharsToUse); 60 | 61 | const description = (await stripMarkdown(nonTitleContent)).trim().replace(/\n/g, ' ') + '...'; 62 | 63 | return { title, description }; 64 | }; 65 | 66 | const generateRSSItem = (postInfo: PostInfo, link: string, date: string) => ` 67 | ${postInfo.title} 68 | ${postInfo.description} 69 | ${link} 70 | ${date} 71 | 72 | `; 73 | 74 | const findBlogsInYearDir = async (yearDir: string) => { 75 | const monthDirs = await fs.readdir(yearDir); 76 | 77 | // Read all of the day directories concurrently into a flat list 78 | const dayDirs = await Promise.all(monthDirs.map(month => { 79 | const fullMonthDir = path.join(yearDir, month); 80 | return fs.readdir(fullMonthDir) 81 | .then(days => days.map(day => path.join(fullMonthDir, day))); 82 | })) 83 | .then(monthIndexedDays => monthIndexedDays.flat()); 84 | 85 | // Read all of the blog entries concurrently into a flat list 86 | return Promise.all(dayDirs.map(dayDir => { 87 | return fs.readdir(dayDir) 88 | .then(blogEntries => blogEntries.map(blog => path.join(dayDir, blog))); 89 | })) 90 | .then(dayIndexedBlogs => dayIndexedBlogs.flat()); 91 | } 92 | 93 | const main = async () => { 94 | const results = await fs.readdir(repoDir); 95 | 96 | const blogDirs = results 97 | .filter(dir => yearRegex.test(dir)) 98 | .map(yearDir => path.join(repoDir, yearDir)); 99 | 100 | const allBlogs = await Promise.all(blogDirs.map(findBlogsInYearDir)) 101 | .then(allBlogsInYear => allBlogsInYear.flat()); 102 | 103 | // Order by date. Probably a better way of doing this, but you know what they say: 104 | // When you've got regular expressions, everything looks like a parsing problem! 105 | const dateExtractionRegex = /.+?(2\d{3}\/\d{1,2}\/\d{1,2}).+/; 106 | allBlogs.sort((a, b) => { 107 | const aMatchResult = a.match(dateExtractionRegex); 108 | const bMatchResult = b.match(dateExtractionRegex); 109 | 110 | if (!aMatchResult || !bMatchResult) return 0; 111 | if (aMatchResult.length < 1 || bMatchResult.length < 1) return 0; 112 | 113 | return +(new Date(aMatchResult[1])) - +(new Date(bMatchResult[1])); 114 | }); 115 | 116 | // Get post info for all blogs 117 | const postInfo = await Promise.all(allBlogs.map(blog => fs.readFile(blog, 'utf-8').then(extractPostInfo))); 118 | 119 | // Get all blog publication dates 120 | const dates = allBlogs.map(blog => { 121 | const extractedDate = blog.replace(repoDir + '/', '').split('/').slice(0, 3).join('/'); 122 | return new Date(extractedDate).toUTCString(); 123 | }); 124 | 125 | // Get URLs for all blogs 126 | const urls = allBlogs.map(blog => blog.replace(repoDir, githubURLPrefix)); 127 | 128 | // Generate rss entries for all blogs 129 | const rssItems: string[] = []; 130 | for (let i = 0; i < postInfo.length; i++) { 131 | rssItems.push(generateRSSItem(postInfo[i], urls[i], dates[i])); 132 | } 133 | 134 | const rssBlogEntries = rssItems.join('\n'); 135 | const rssFeed = rssTemplate.replace('{blog_entries}', rssBlogEntries); 136 | 137 | await fs.writeFile(feedPath, rssFeed, 'utf-8'); 138 | } 139 | 140 | main(); -------------------------------------------------------------------------------- /feed-builder/package.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "feed-builder", 3 | "type": "module", 4 | "version": "1.0.0", 5 | "description": "An RSS feed builder for the githublog", 6 | "main": "index.ts", 7 | "scripts": { 8 | "test": "echo \"Error: no test specified\" && exit 1" 9 | }, 10 | "repository": { 11 | "type": "git", 12 | "url": "git+https://github.com/francisrstokes/githuBlog.git" 13 | }, 14 | "author": "Francis Stokes", 15 | "license": "MIT", 16 | "bugs": { 17 | "url": "https://github.com/francisrstokes/githuBlog/issues" 18 | }, 19 | "homepage": "https://github.com/francisrstokes/githuBlog#readme", 20 | "devDependencies": { 21 | "@types/node": "^18.16.3" 22 | }, 23 | "dependencies": { 24 | "remark": "^14.0.2", 25 | "strip-markdown": "^5.0.0" 26 | } 27 | } 28 | -------------------------------------------------------------------------------- /feed-builder/tsconfig.json: -------------------------------------------------------------------------------- 1 | { 2 | "compilerOptions": { 3 | "target": "ESNext", 4 | "module": "ESNext", 5 | "moduleResolution": "node", 6 | "strict": true, 7 | "noImplicitAny": true, 8 | }, 9 | "include": [ 10 | "*" 11 | ] 12 | } -------------------------------------------------------------------------------- /feed.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Francis Stokes :: Githublog 5 | I'm sick of complex blogging solutions, so markdown files in a git repo it is. 6 | https://github.com/francisrstokes/githublog 7 | 2024 Francis Stokes - All rights reserved 8 | Sun, 01 Dec 2024 21:09:46 GMT 9 | Sun, 01 Dec 2024 21:09:46 GMT 10 | 1440 11 | 12 | 13 | Doing more than one thing at a time 14 | 15 | How do computers run multiple independent programs at once? Maybe words like "multi-core" or "hardware threads" pop into your mind. These are some of the solutions of the modern age, which essentially just throw more hardware at the problem in order to achieve p... 16 | https://github.com/francisrstokes/githublog/blob/main/2021/12/14/doing-more-than-one-thing.md 17 | Tue, 14 Dec 2021 00:00:00 GMT 18 | 19 | 20 | 21 | Notes on a lateral career move 22 | 23 | I changed jobs recently. That's not all that uncommon for devs - in fact it's accepted wisdom that your best course of action is to switch every 2 years or so in order to maximise your salary. That may or may not be true, but it's not really the route I've taken. What was perhaps a little more unusual about my change was that I made a somewhat lateral move between fields of software engineering, from mostly web-based... 24 | https://github.com/francisrstokes/githublog/blob/main/2022/4/29/notes-on-a-lateral-career-move.md 25 | Fri, 29 Apr 2022 00:00:00 GMT 26 | 27 | 28 | 29 | Rolling your own crypto: Everything you need to build AES from scratch (and then never use it for anything of consequence) 30 | 31 | You often hear the phrase "Don't roll your own crypto". I think this sentence is missing an important qualifier: "...and then use it for anything of consequence". If you are building a product or service, or are trying to communicate privately, then you should absolutely pick a vetted, open source, off-the-shelf implementation and use it. If, however, your goal is to learn, then there is honestly no better wa... 32 | https://github.com/francisrstokes/githublog/blob/main/2022/6/15/rolling-your-own-crypto-aes.md 33 | Wed, 15 Jun 2022 00:00:00 GMT 34 | 35 | 36 | 37 | Building A Jank UART to USB Cable From Scavenged Parts 38 | 39 | My home "lab" is, unfortunately, a manifestation of the unwinnable, uphill battle against entropy. The latest victim to the sprawl of boards, prototypes, and other miscellanea was my little Adafruit CP2104 USB to serial converter. As far as I can tell, it has literally dropped off the face of the earth. This is particularly irritating as I'm in the middle of another project for an upcoming video on the \[channel]\(http... 40 | https://github.com/francisrstokes/githublog/blob/main/2023/3/1/building-a-jank-uart-cable-from-scavenged-parts.md 41 | Wed, 01 Mar 2023 00:00:00 GMT 42 | 43 | 44 | 45 | Why the CORDIC algorithm lives rent-free in my head 46 | 47 | This post is an adaptation of a twitter thread I put together a few days ago. CORDIC is an algorithm for computing trig functions like sin, cos, tan etc on low powered hardware, without an FPU (i.e. no floating point) or expensive lookup tables. In fact, it reduces these complex functions to simple addit... 48 | https://github.com/francisrstokes/githublog/blob/main/2024/5/10/cordic.md 49 | Fri, 10 May 2024 00:00:00 GMT 50 | 51 | 52 | 53 | Everything I Know About The Fast Inverse Square Root Algorithm 54 | 55 | The fast inverse square root algorithm, made famous (though not invented) by programming legend John Carmack in the Quake 3 source code, computes an inverse square root $\frac{1}{\sqrt{x}}$ with a bewildering handful of lines that interpret and manipulate the raw bits of float. It's wild.... 56 | https://github.com/francisrstokes/githublog/blob/main/2024/5/29/fast-inverse-sqrt.md 57 | Wed, 29 May 2024 00:00:00 GMT 58 | 59 | 60 | 61 | I sent an ethernet packet 62 | 63 | For as long as I've been making videos on the low byte productions youtube channel, I've wanted to make a series about "Networking from scratch", by which I mean building a full TCP/IP stack from the ground up on a microcontroller. It's been nearly 6 years now, and the past few days felt like as good a time as any to start. This blog entry is fairly limited in scope; On the... 64 | https://github.com/francisrstokes/githublog/blob/main/2024/11/1/sending-an-ethernet-packet.md 65 | Fri, 01 Nov 2024 00:00:00 GMT 66 | 67 | 68 | 69 | I got an IP address 70 | 71 | This is a follow up to a previous blog entry, I sent an ethernet packet. I'm writing a networking stack on a microcontroller. Not for production, or to make the fastest/smallest footprint/insert metric here, but just to get a deeper understanding about how things work all the way at the bottom, and hopefully to be able... 72 | https://github.com/francisrstokes/githublog/blob/main/2024/11/26/getting-an-ip-address.md 73 | Tue, 26 Nov 2024 00:00:00 GMT 74 | 75 | 76 | 77 | 78 | 79 | -------------------------------------------------------------------------------- /generate-feed.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | ts-node --esm feed-builder --------------------------------------------------------------------------------