<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="http://addumb.com/feed.xml" rel="self" type="application/atom+xml" /><link href="http://addumb.com/" rel="alternate" type="text/html" /><updated>2026-01-09T08:04:53+00:00</updated><id>http://addumb.com/feed.xml</id><title type="html">Addumb’s blog</title><subtitle>Blog mixing tech, linux, engineering, midwest transplantation, and other random junk.</subtitle><author><name>Adam Gray</name></author><entry><title type="html">My Local-only shitty AI opportunistic doorbell camera explainer explainer</title><link href="http://addumb.com/2025/12/06/local-only-shitty-ai-opportunistic-doorbell-camera-description/" rel="alternate" type="text/html" title="My Local-only shitty AI opportunistic doorbell camera explainer explainer" /><published>2025-12-06T15:00:00+00:00</published><updated>2025-12-06T15:00:00+00:00</updated><id>http://addumb.com/2025/12/06/local-only-shitty-ai-opportunistic-doorbell-camera-description</id><content type="html" xml:base="http://addumb.com/2025/12/06/local-only-shitty-ai-opportunistic-doorbell-camera-description/"><![CDATA[<p>tl; dr: For today at least, I use Home Assistant’s HACS integrations for vivint (my home security system), ollama-vision (I don’t know wtf this means), and my <em>Windows</em> gaming PC which has an RTX 5080 (16GB VRAM) to give a slow, inaccurate, and unreliable description of who or what is at my front door based on AI magicks’ing my doorbell camera. It takes around 3 seconds usually, sometimes far less (600ms) and sometimes far more (6 fucking minutes). But it is 100% completely self-hosted and bereft of subscription fees, though thus wildly unreliable.</p>

<p>Send the classic “somebody is at your front door” notification, but with genAI slop trying to roast you when you get home:</p>

<p><img src="/assets/home-assistant-doorbell-roast.jpg" alt="bruh, dude's just chillin' at the door, no mask, no gear, just a package, like what's the drama?" width="800" /></p>

<p>While <strong>100% self-hosted on your own network</strong> on mostly open source stuff. On Linux the closed-source NVIDIA drivers are required for ollama to be useful.</p>

<p>I don’t know what MCP is, I don’t know what “local inference” means and I don’t give a shit and that’s my problem professionally, but not personally. I want shitty automated jokes about people at my door delivered to my phone within a few mintues maybe. I don’t know in any detail what <code class="language-plaintext highlighter-rouge">llama3.2-vision</code> means and how it compares to anything else listed on ollama models lists. I’m not a luddite, I’m just a human bean and not super excited about techno-fascists sending obviated labor to prisions via a few trivial intermediate steps.</p>

<ol>
  <li>Run Home Assistant on a NUC in your closet or garage, your “MPoE.” Don’t think about it.</li>
  <li>Have fun gaming on a PC with a “decent” (read: great) GPU.</li>
  <li>Use your PC as a space heater to give some laughs.</li>
</ol>

<h2 id="get-to-the-point">Get to the point:</h2>
<p>I’d been considering buying an additional mini PC from my ~$1K ASUS NUC 14 Pro with 128G-fuckin-B of RAM, a ~$2K AMD Ryzen AI Max+ 395 system like the Framework Desktop due to its unified memory, giving the integrated GPU something like 72G-fuckin-B of <strong>VRAM</strong>. But I already have a gaming PC with which also cost me $2K, so why can’t I try that paltry 16GB VRAM GPU of an RTX 5080? It works. Poorly and inaccurately, but it totally works well enough to give me a laugh instead of another headache.</p>

<p>The pieces, I think:</p>
<ol>
  <li>Run Home Assistant on some rando place, nobody cares where just do it it’s fun to make your lights turn off at bedtime.</li>
  <li>Enjoy gaming on a PC and forego the budget anguish of 2025 PC gaming with 60%+ of the build cost being solely the GPU.</li>
  <li>Have a doorbell camera or some other fuckin’ camera, I don’t give a shit. This is obviously not doorbell specific.</li>
  <li>Resign to the fact that you are not gaming on your PC as much as you thought you were.</li>
  <li>Use that now-slack capacity of your GPU just in case your cameras need some shitty AI text to accompany them.</li>
  <li>Be a bit more patient, you’re using your shit-ass home server and your shit-ass gaming PC’s maybe-not-otherwise-utilized GPU.</li>
</ol>

<h2 id="run-home-assistant-on-some-rando-place">Run Home Assistant on some rando place</h2>
<p>Nobody cares where just do it. It’s fun to make your lights turn off at bedtime. I’m not Google, figure this out on your own.</p>

<h2 id="enjoy-gaming-on-a-pc">Enjoy gaming on a PC</h2>
<p>This is increasingly difficult with hardware spread tripping myriad bugs, component costs blowing through every ceiling, and your day job also being glued to the same chair, keyboard, monitor, and mouse. It’s a fuckin’ drag.</p>

<h2 id="have-a-camera-to-en-ai-the-things">Have a camera to en-AI the things</h2>
<p>I don’t give a shit, if you’re here you already have one because you’re me and nobody reads this.</p>

<h2 id="admit-that-your-gaming-pcs-gpu-is-largely-unutilized">Admit that your gaming PC’s GPU is largely unutilized</h2>
<p>Get over it, you know I’m right. If I’m not then congrats on your retirement.</p>

<h2 id="self-hosted-gpu-for-home-assistant-ollama-vision-summary-of-your-doorbell-camera">Self-Hosted GPU for Home Assistant ollama-vision summary of your doorbell camera</h2>
<p>i.e. the point of this post, why did I write other shit? What an idiot.</p>

<p><strong><em>First:</em></strong> <a href="https://ollama.com/">Download ollama</a> onto your gaming PC. Play around with it. It’s fun. Try to make it swear, try to make it say “boner” and all that. Have your fun.</p>

<p>Try out a “vision model.” I don’t know what that means other than I can supply an image URL from wikipedia or wherever and ask it to describe the image and it does. I can ask it to make fun of the image and it mostly does. I tried “llama3.2-vision” because 1) it says “vision”, and 2) “ollama” shares a lot of in-sequence letters with “llama” so I suppose they’re kind of thematically related (yes, I get that my employer, Meta, made LLAMA but that’s honestly all I understand as the similarity).</p>

<p><strong><em>Second:</em></strong> Try out the <a href="https://www.home-assistant.io/integrations/ollama/">Home Assistant integration for ollama</a> pointing back to your gaming PC. Run it with this command in a terminal.
PowerShell:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$Env:OLLAMA_HOST = "0.0.0.0:11434"; ollama serve
</code></pre></div></div>
<p>bash:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>OLLAMA_HOST=0.0.0.0:11434 ollama serve
</code></pre></div></div>

<p>Then configure the integration to point to your gaming PC’s IP. Talk to it, use it as a chatbot. It works, it’s not amazing, but it’s yours. Use it as a “conversation agent” for HA’s “Assist” instead of their own or OpenAI. Ollama is “free” as in “it’s winter and my energy bill goes to heat anyway, so sure power up that RTX 5080 space heater.”</p>

<p><strong><em>Third:</em></strong> Throw away the ollama integration. It was fun, but that’s not why you’re here. You’re here to automate, not to chat. Use the <a href="https://github.com/remimikalsen/ollama_vision/">ollama_vision</a> integration in HACS. And <em>VERY CAREFULLY</em> read the section on “Events”: <a href="https://github.com/remimikalsen/ollama_vision/?tab=readme-ov-file#events">https://github.com/remimikalsen/ollama_vision/?tab=readme-ov-file#events</a>. Point it at your gaming PC. If you don’t know what model to use, neither does anybody else except for the mega fans. NOBODY ELSE CARES, folks. I picked llama3.2-vision and it works. I don’t know why, but it works. I suppose it’s because it says “llama” and “vision?” I don’t care.</p>

<p><strong><em>Fourth:</em></strong> Create an automation to <em>manually</em> send a <a href="https://en.wikipedia.org/wiki/Hot_dog#/media/File:Hot_dog_with_mustard.png">Wikipedia hotdog image</a> to ollama-vision when you <em>manually</em> force the automation to run. Play with it. Change the URL around a lot. Try the not-a-hotdog thing, whatever. Get this working first. In order to do so, you’ll need 4 things open at once: ollama in powershell, Home Assistant automation editor, <strong>Home Assistant event subscribing to <code class="language-plaintext highlighter-rouge">ollama_vision_image_analyzed</code></strong> (not obvious in any docs), and the Home Assistant logs. For the Home Assistant automation you’ll want something like this for the full automation:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>alias: hotdog
description: ""
triggers: []
conditions: []
actions:
  - action: ollama_vision.analyze_image
    metadata: {}
    data:
      prompt: Is this a hotdog?
      image_url: &gt;-
        https://www.w3schools.com/html/pic_trulli.jpg
      device_id: XXXXXXXXXXXXXXXXXXXXXX
      image_name: hotdog

</code></pre></div></div>
<p>Hint: that wikipedia hotdog URL doesn’t work but https://www.w3schools.com/html/pic_trulli.jpg does and it’s <em>not</em> a hot dog nor a penis.</p>

<p>Errors I hit:</p>
<ul>
  <li>Home Assistant logs showed an HTTP 403 error when fetching an image by URL (like the Wikipedia hotdog image). Solution: use a different test image URL.</li>
  <li>Formatting yaml is fuckin’ dumb so of course everything is always wrong at first. Solution: git gud.</li>
  <li>Can’t see how it all hops across from manual press -&gt; image fetch -&gt; ollama GPU stuff -&gt; notification. Solution: really seriously open all 4 of those things in one screen, 1/4 each.</li>
</ul>

<p><strong><em>Fifth:</em></strong> Once you have not-a-hotdog working, you should see something in the Home Assistant event subscription UI on <code class="language-plaintext highlighter-rouge">ollama_vision_image_analyzed</code> like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>event_type: ollama_vision_image_analyzed
data:
  integration_id: XXXXXXXXXXXXXXXXXXXXXX
  image_name: hotdog
  image_url: https://www.w3schools.com/html/pic_trulli.jpg
  prompt: Is this a hotdog?
  description: &gt;-
    No, this is not a hotdog. This is a picture of a traditional Italian village
    called Alberi, located in the region of Pugli in Italy. The village is known
    for its unique architecture and is often referred to as the "trulli"
    village. The trulli are small, round, and cone-shaped houses made of stone
    and are typically found in this region. The village is also known for its
    beautiful scenery and is a popular tourist destination.
  used_text_model: false
  text_prompt: null
  final_description: &gt;-
    No, this is not a hotdog. This is a picture of a traditional Italian village
    called Alberi, located in the region of Pugli in Italy. The village is known
    for its unique architecture and is often referred to as the "trulli"
    village. The trulli are small, round, and cone-shaped houses made of stone
    and are typically found in this region. The village is also known for its
    beautiful scenery and is a popular tourist destination.
origin: LOCAL
time_fired: "2025-12-07T05:55:21.819316+00:00"
context:
  id: XXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  parent_id: null
  user_id: null
</code></pre></div></div>
<p>Here your important information for the notification is “image_name” and “final_description”. You set “image_name” in the ollama-vision integration.</p>

<p><strong><em>Sixth:</em></strong> Make an automation which triggers on your doorbell or motion, don’t care, and takes a camera snapshot. Then send that camera snapshot over to ollama-vision.
For example, the full YAML for this trigger-&gt;ollama-vision automation is:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>alias: Poke ollama vision
description: ""
triggers:
  - trigger: state
    entity_id:
      - binary_sensor.hotdog_motion
    from:
      - "off"
      - "on"
conditions: []
actions:
  - action: camera.snapshot
    metadata: {}
    data:
      filename: /config/www/snapshots/hotdog.jpg
    target:
      entity_id: camera.doorbell
  - action: notify.mobile_app_your_phone_name
    metadata: {}
    data:
      message: Motion at the hotdog
      data:
        image: /local/snapshots/hotdog.jpg
  - action: ollama_vision.analyze_image
    metadata: {}
    data:
      prompt: Is that a hotdog?
      image_url: &gt;-
        http://homeassistant.lan:8123/local/snapshots/hotdog.jpg
      device_id: xxxxxxxxxxxxxxxxxxxxx
      image_name: hotdog
mode: single
</code></pre></div></div>
<p>This does a few things:</p>
<ol>
  <li>triggers when the hotdog camera motion binary sensor switches from off to on (when there is motion detected).</li>
  <li>without condition (not a great idea)</li>
  <li>snapshot the camera to <em>a location which is AN OPEN URL FOR ANYTHING AUTHENTICATED TO HOME ASSISTANT (also not a great idea)</em></li>
  <li>Notify your phone with that image so you can refer to it later</li>
  <li>Huck it off to your gaming PC’s ollama to hopefully maybe describe it unless you’re gaming.</li>
</ol>

<p><strong><em>Seventh:</em></strong> [EDIT] This is <strong>NOT</strong> separately, I learned how <code class="language-plaintext highlighter-rouge">wait_for_trigger</code> in automation <code class="language-plaintext highlighter-rouge">actions</code> lists then can have a <code class="language-plaintext highlighter-rouge">trigger</code> type of <code class="language-plaintext highlighter-rouge">event</code> which lets a later action reference a template value of <code class="language-plaintext highlighter-rouge">wait.trigger.event</code>. The prior version of this said to make an additional automation to catch the response. No need.</p>

<p>However, if you want to try it out to play with it, because this is all supposed to be fun to begin with, you’ll want to unpack the response event:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>event_type: ollama_vision_image_analyzed
data:
  integration_id: XXXXXXXXXXXXXXXXXXXXXXXX
  image_name: hotdog
  image_url: https://www.w3schools.com/html/pic_trulli.jpg
  prompt: Is this a hotdog?
  description: &gt;-
    No, this is not a hotdog. This is a picture of a traditional Italian village
    called Alberi, located in the region of Pugli in Italy. The village is known
    for its unique architecture and is often referred to as the "trulli"
    village. The trulli are small, round, and cone-shaped houses made of stone
    and are typically found in this region. The village is also known for its
    beautiful scenery and is a popular tourist destination.
  used_text_model: false
  text_prompt: null
  final_description: &gt;-
    No, this is not a hotdog. This is a picture of a traditional Italian village
    called Alberi, located in the region of Pugli in Italy. The village is known
    for its unique architecture and is often referred to as the "trulli"
    village. The trulli are small, round, and cone-shaped houses made of stone
    and are typically found in this region. The village is also known for its
    beautiful scenery and is a popular tourist destination.
origin: LOCAL
time_fired: "2025-12-07T05:55:21.819316+00:00"
context:
  id: XXXXXXXXXXXXXXXXXX
  parent_id: null
  user_id: null
</code></pre></div></div>

<p>Now that you see the event structure, you can adjust your prompt, text_prompt, and grab the final_description from the waited event plus the image URL from the camera snapshot and send a mobile notification. Remember, notification structure is like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  - action: notify.mobile_app_derp
    data:
      message: |
        {{ wait.trigger.event.data.final_description }}
      title: Is there a hotdog at the door?
      data:
        image: /local/snapshots/{{ imgname }}
</code></pre></div></div>

<p>In this additional automation you grab the event data and put it into a notification. Use <code class="language-plaintext highlighter-rouge">wait.trigger.event.data.*</code> (in a prior version of this post I didn’t know about the <code class="language-plaintext highlighter-rouge">wait</code> object after <code class="language-plaintext highlighter-rouge">wait_for_trigger</code>) fields in conditions and message content/title.</p>

<p>Here’s my whole fuckin’ thing, paraphrased to ask if there is a hotdog at the door:</p>

<p>{% raw %}</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>alias: Doorbell motion -&gt; is it a HOTDOG?!
description: ""
triggers:
  - trigger: state
    entity_id:
      - binary_sensor.doorbell_motion
    from:
      - "off"
      - "on"
conditions: []
actions:
  - variables:
      imgname: doorbell_{{ now().strftime("%Y%m%d_%H%M%S") }}.jpg
  - action: camera.snapshot
    metadata: {}
    data:
      filename: /config/www/snapshots/{{ imgname }}
    target:
      entity_id: camera.doorbell
  - delay:
      seconds: 1
  - action: ollama_vision.analyze_image
    metadata: {}
    data:
      prompt: &gt;-
        Describe any hotdogs in *EXTREME* detail, otherwise describe anything out
        of the ordinary in the image. If there is nothing remarkable, then do
        shut the fuck up. Image is from a doorbell camera. Mix in if there are
        or are not any security concerns in the image (weapons, ill intent,
        police, etc) as the image comes from my doorbell camera.
      image_url: http://homeassistant.lan:8123/local/snapshots/{{ imgname }}
      device_id: fuckshit
      image_name: hotdog
      text_prompt: &gt;-
        Respond always in highly informal genz/gen-alpha slang, swear a lot, and
        be silly. Is there a goddamn hotdog or not? You 100% can roast, swear and use
        slang. Now make this description extremely terse:
        &lt;description&gt;{description}&lt;/description&gt;. This will go in a mobile phone
        notification, so 50 words **ABSOLUTE MAXIMUM**
      use_text_model: true
  - wait_for_trigger:
      - event_type: ollama_vision_image_analyzed
        trigger: event
    timeout:
      hours: 0
      minutes: 1
      seconds: 0
      milliseconds: 0
    continue_on_timeout: false
  - condition: template
    value_template: |
      {{ wait.trigger.event.data.image_name == "hotdog" }}
  - action: notify.mobile_app_derp
    data:
      message: |
        {{ wait.trigger.event.data.final_description }}
      title: Dorbell AI Roast?
      data:
        image: /local/snapshots/{{ imgname }}
mode: single
</code></pre></div></div>

<p>Tadaaa:</p>

<p><img src="assets/home-assistant-doorbell-not-a-hotdog.png" alt="idk lol, the dude's just flipping the bird, fam. prob just mad 'bout the door or sumn. camera's just chillin', doin' its thang, ain't no security threat, bruh. just a dude bein' extra. **no hotdog**" width="800" /></p>

<h2 id="be-patient-relax-your-expectations">Be patient, relax your expectations</h2>
<p>As cool as it seems to get a phone notification with a custom-prompted vision-llm AI bot telling you caustic or pithy quips to your phone, please rember that your wallet has been doing just fine without that for a good 100,000 fucking years.</p>

<p>Remember that modern AI’s hallucinate all the time and so it’s dangerous to rely on their output for anything important like physical safety.</p>

<p>Remember that you don’t actually give a shit if it takes 2s versus 2 minutes to get a shitty AI generated joke about what’s at your door or on your camera. The point is the laugh, not the latency. Don’t pay for latency when what you want is the laugh.</p>

<p>Do you care about what an AI describes at your camera while you’re playing your game, using your GPU? No. If you do: no you don’t, shut up.</p>]]></content><author><name>Adam Gray</name></author><category term="self-hosted" /><category term="AI" /><category term="shitty-AI" /><category term="home-assistant" /><category term="fun" /><category term="AI" /><category term="shitty-AI" /><category term="home-assistant" /><category term="fun" /><summary type="html"><![CDATA[Self-hosted low-end AI (high-end gaming GPU) to give some laughs]]></summary></entry><entry><title type="html">The Cost of 4K 60fps video storage, a dad’s perspective</title><link href="http://addumb.com/2021/02/20/cost-of-4k-60fps-video-storage-a-dads-perspective/" rel="alternate" type="text/html" title="The Cost of 4K 60fps video storage, a dad’s perspective" /><published>2021-02-20T07:00:00+00:00</published><updated>2021-02-20T07:00:00+00:00</updated><id>http://addumb.com/2021/02/20/cost-of-4k-60fps-video-storage-a-dads-perspective</id><content type="html" xml:base="http://addumb.com/2021/02/20/cost-of-4k-60fps-video-storage-a-dads-perspective/"><![CDATA[<p>I’m a dad, now! Yay! My son is awesome. Being a dad during the COVID-19 pandemic is awful in myriad ways. It has silver linings, but this post isn’t about that.</p>

<h1 id="i-ran-out-of-icloud-storage">I ran out of iCloud storage</h1>
<p>I’ve been taking 4K 60fps videos of my son doing things which I find earth-shattering: looking at me, sneezing, babbling, laughing, eating, even bathing and peeing LOL. I’m taking these in 4k 60fps on my iPhone whatever (11 pro, midnight green for sure) which offers a simple set of toggles to do so. Obviously, recording video at 4x the resoultion and 2x the framerate will likely cost in the ballpark of 8x in storage according to napkin math. This is the same ballpark as an order of magnitude increase in storage for video. It’s not, but my napkin math instinct should have informed me.</p>

<p>These toggles are, to my new dad perspective, future proofing my memories. Back in the days before the pandemic, people would have high school graduation parties. At my own, my parents lovingly trawled through the family archives to unearth the most dubiously endearing but truthfully embarrassing photos and videos they could find. Great success on there part, if memory serves me well.</p>

<p>I want a trove of ancient videos to share with the world when my son hits these comming of age milestones. I want to show him how I’m such an idiot that I sacrificed future flexibility on my part, fitting-in better on his part, and generally not being embarrassed on all parts. I’m already so proud of him, I cannot imagine how proud I’ll be if he has one of these shindigs to celebrate a major achievement among his peers. Him taking a crap is a major achievement in my book, so I’m just gonna be constantly gushing about how awesome he is. I have to keep that discount for myself.</p>

<p>So, I ran out of iCloud storage, of course. Who tf doesn’t? That’s the entire point of iCloud storage: to run out and pay apple for more “backups.” This got me thinking, how am I gonna save these 4k 60fps videos? How am I going to archive them and then retrieve them at a later point? Well, let’s start with Apple’s estimates of storage costs of these videos. The iPhone settings app has estimates listed under the camera details:</p>

<blockquote>
  <p>A minute of video will be approximately:</p>
  <ul>
    <li>40 MB with 720p HD [editorial: LOL @ HD] at 30 fps (space saver)</li>
    <li>60 MB with 1080p HD at 30 fps (default)</li>
    <li>90 MB with 1080p HD at 60 fps (smoother)</li>
    <li>135 MB with 4k [editorial: why no “UHD”?] at 24 fps (film style) 🙄</li>
    <li>170 MB with 4k at 30 fps (higher resolution)</li>
    <li>400 MB with 4k at 60 fps (higher resolution, smaller)</li>
  </ul>
</blockquote>

<p>1080p @30fps is 60MB/minute or 1MB/s. 4k @60fps is 400MB per minute, or 6.66667whatever MB/s. Let’s invert it so we don’t have repeating decimals, they’re annoying in text: 1080p @30fps is 0.15 times the size of 4k @60fps. Cool, 8x (or 1/0.125x) was pretty close! Let’s just blame the difference as a rounding compromise at Apple to keep the camera settings nice and neat, though imprecise. We’re still gonna use their numbers for now, though. My napkin math is supported by only one scientist.</p>

<h1 id="how-much-video-do-i-take">How much video do I take?</h1>
<p>Well this is highly variable, but if there’s one thing I learned to get my Physics degree, it’s that you can approximate the shape of a cow as roughly a sphere. So let’s approximate. Let’s see how much I’ve taken at 4k 60fps so far:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ls -l /mnt/lilstorage/Videos/Other/iPhone/2021-02-20/|wc -l
310
</code></pre></div></div>
<p>Ok, 310 videos. Let’s see how to get their duration. (googling…) looks like <code class="language-plaintext highlighter-rouge">mediainfo</code> will do it, so <code class="language-plaintext highlighter-rouge">sudo apt-get install mediainfo</code> or whatever, then away we go:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ for f in *; do mediainfo $f | grep ^Duration | head -1; done
... snip
Duration                                 : 1 min 4 s
Duration                                 : 50 s 277 ms
... snip
</code></pre></div></div>
<p>Great, the tool unhelpfully assumes that I, a human, am not a computer. Ends up humans can compute, so that’s annoying. Oh, but it supports * args and has a JSON output option:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ mediainfo --Output=JSON * | jq .
... snip uhhhh on second thought, you don't want to see this
</code></pre></div></div>
<p>Ok, so it has some structure. For each file, it outputs 3 “tracks”: general, video, and audio. But it already looks a bit askance. I need to compare a 30fps and 60fps file, a 4k and a 1080p file, and maybe even the full cross to understand this tool. One 4k 60fps file I have is named IMG_4165.MOV, and a 1080p 30fps file is IMG_3603.MOV. Let’s compare their outputs:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ mediainfo --Output=JSON IMG_4165.MOV IMG_3603.MOV | jq .
... snip 
FrameRate
Duration
Height
Width
... snip
</code></pre></div></div>
<p>Yeah, I crapped out, but that’s the point at least: it’s getting a bit more complicated just to tally up the duration of 4k vs 1080p videos I’ve taken. I’ll plop this out to a file and reat it into ipython:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ mediainfo --Output=JSON * | jq '.[].media.track[1] | "\(.Width) \(.Height) \(.FrameRate) \(.Duration)" | sed 's/"//' &gt; ~/derp.txt
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">jq</code> cares about quotes, I don’t.</p>

<p>Then let’s load it up and group things together:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python3
</span><span class="n">derp</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">"derp.txt"</span><span class="p">,</span> <span class="s">"r"</span><span class="p">).</span><span class="n">readlines</span><span class="p">()</span>

<span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">namedtuple</span><span class="p">,</span> <span class="n">defaultdict</span>

<span class="n">vidfile</span> <span class="o">=</span> <span class="n">namedtuple</span><span class="p">(</span><span class="s">"vidfile"</span><span class="p">,</span> <span class="s">"width height framerate duration"</span><span class="p">)</span>
<span class="n">vidfiles</span> <span class="o">=</span> <span class="p">[</span><span class="n">vidfile</span><span class="p">(</span><span class="o">*</span><span class="n">l</span><span class="p">.</span><span class="n">strip</span><span class="p">().</span><span class="n">split</span><span class="p">(</span><span class="s">" "</span><span class="p">))</span> <span class="k">for</span> <span class="n">l</span> <span class="ow">in</span> <span class="n">derp</span><span class="p">]</span>


<span class="k">def</span> <span class="nf">framebucket</span><span class="p">(</span><span class="n">fps</span><span class="p">):</span>
    <span class="n">fps</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="n">fps</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">fps</span> <span class="o">&lt;</span> <span class="mi">20</span><span class="p">:</span>
        <span class="k">return</span> <span class="s">"timelapse"</span>
    <span class="k">elif</span> <span class="n">fps</span> <span class="o">&lt;</span> <span class="mi">26</span><span class="p">:</span>
        <span class="k">return</span> <span class="s">"24"</span>
    <span class="k">elif</span> <span class="n">fps</span> <span class="o">&lt;</span> <span class="mi">35</span><span class="p">:</span>
        <span class="k">return</span> <span class="s">"30"</span>
    <span class="k">elif</span> <span class="n">fps</span> <span class="o">&lt;</span> <span class="mi">65</span><span class="p">:</span>
        <span class="k">return</span> <span class="s">"60"</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="k">return</span> <span class="s">"slowmo"</span>


<span class="n">vid</span> <span class="o">=</span> <span class="n">namedtuple</span><span class="p">(</span><span class="s">"vid"</span><span class="p">,</span> <span class="s">"res framebucket duration"</span><span class="p">)</span>
<span class="n">vids</span> <span class="o">=</span> <span class="p">[</span>
    <span class="n">vid</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">v</span><span class="p">.</span><span class="n">width</span><span class="p">)</span> <span class="o">*</span> <span class="nb">int</span><span class="p">(</span><span class="n">v</span><span class="p">.</span><span class="n">height</span><span class="p">),</span> <span class="n">framebucket</span><span class="p">(</span><span class="n">v</span><span class="p">.</span><span class="n">framerate</span><span class="p">),</span> <span class="n">v</span><span class="p">.</span><span class="n">duration</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">vidfiles</span>
<span class="p">]</span>

<span class="n">summary</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">vids</span><span class="p">:</span>
    <span class="n">summary</span><span class="p">[</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">v</span><span class="p">.</span><span class="n">res</span><span class="si">}</span><span class="s"> @ </span><span class="si">{</span><span class="n">v</span><span class="p">.</span><span class="n">framebucket</span><span class="si">}</span><span class="s">"</span><span class="p">]</span> <span class="o">+=</span> <span class="nb">float</span><span class="p">(</span><span class="n">v</span><span class="p">.</span><span class="n">duration</span><span class="p">)</span>

<span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">summary</span><span class="p">.</span><span class="n">items</span><span class="p">()):</span>
    <span class="k">print</span><span class="p">(</span><span class="n">k</span><span class="p">,</span> <span class="n">v</span><span class="p">)</span>

</code></pre></div></div>
<p>This outputs that I make 5x more 1080p @ 30fps videos than 4k @ 60fps videos:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1701120 @ 24 15.255
1701120 @ 60 42.568999999999996
2073600 @ 24 896.58
2073600 @ 30 10020.490000000009
2073600 @ 60 313.47099999999995
2073600 @ slowmo 18.997
2764800 @ 30 1.8079999999999998
8294400 @ 30 233.902
8294400 @ 60 2065.2369999999996
921600 @ 30 632.647
</code></pre></div></div>
<p>Ignore the rounding errors, you’ll see 10020.5 and 2065.237 for each, respectively.</p>

<h1 id="how-much-video-will-i-take">How much video <em>will</em> I take?</h1>
<p>I confess: I take a bunch of very short videos of specific interactions. I’m trying to take longer videos, but it’s hard to do that while the subject tries their best to get into trouble.</p>

<p>Well my son is 9 months old (minus 3 days) and I’ve taken lots. My phone broke and I had to swap-up when he was only 1 month old and that’s what these files cover. So 10020 seconds in 8 months of 1080p 30fps video and 2065 seconds of 4k 60fps video, or on rough average: 0.347917 hours per month of 1080p 30fps and only 0.07170138 hours per month of 4k 60fps.</p>

<p>If we pretend I keep my habit as-is, by the time he’s 18, I’ll have 75 hours of ancient 1080p 30fps video and a measly 15.5 hours of less-ancient 4k 60fps video. Decent to choose from for a simple carousel style loop.</p>

<p><em>Phew</em>! I was worried I’d need to factor in how 1080p -&gt; 4k -&gt; 8k etc progression will happen and estimate filesize growth over time.</p>

<p>By Apple’s estimates, my 15.5 hours of 4k 60fps video will take only about 360GB to store. Easy peasy!</p>

<h1 id="technology-progression-costs-more-storage">Technology progression costs more storage!</h1>
<p>But wait, I do need to estimate the progression! I don’t have the energy to estimate, so I’ll over-estimate by supposing my entire archive is best approximated as being 400MB per minute (nullifying the fun I just had): 2TB. Big whoop. I guess I won’t get that NAS and will just shove them into S3 glacier for now.</p>]]></content><author><name>Adam Gray</name></author><category term="Tech" /><category term="Tech" /><summary type="html"><![CDATA[Recording 4k 60fps iPhone videos costs nearly nothing for an individual]]></summary></entry><entry><title type="html">Mobile App Operability</title><link href="http://addumb.com/2020/07/15/mobile-app-operability/" rel="alternate" type="text/html" title="Mobile App Operability" /><published>2020-07-15T07:00:00+00:00</published><updated>2020-07-15T07:00:00+00:00</updated><id>http://addumb.com/2020/07/15/mobile-app-operability</id><content type="html" xml:base="http://addumb.com/2020/07/15/mobile-app-operability/"><![CDATA[<p>Mobile apps often get a bad rep from backend infrastructure people. Rightly so when comparing operability between backend infrastructure services and mobile applications. First, so that we’re on the same page: operability is trait of a software system which makes detection,  remediation, and anticipation of errors is low-effort. What is this for mobile applications? Well… it’s the trait of a mobile application which makes detection, remediation, and anticipation of errors is low-effort, of course. Mobile apps are software systems. They’re usually best modeled as trivially parallelized distributed software systems which just happen to run on mobile devices rather than on owned infrastructure.</p>

<p><strong><em>Mobile operability is the trait of mobile apps which makes detection, diagnosis, remediation, and anticipation of errors low-cost and low-effort.</em></strong></p>

<p>Errors here can be anything unwanted: crashes, freezes, error messages, lacking correct responses to interactions, and even having incorrect responses to interactions.</p>

<h1 id="detection">Detection</h1>

<p>To consider what good operability is for mobile apps, we need a way to detect these errors. To detect errors, we have to record them and deliver that record to a detection system. That system does not have to be outside of the app! In fact, it’s best if your app has a self-diagnostic mechanism built-in to help understand errors in a way that respects your users’ privacy. Common detection mechanisms are crashlytics, the Google Play console Vitals reports, and app reviews and star ratings. The app usage analytics system should also be used for collecting errors, though you should use a separate analytics aggregation/processing method here to ensure your error data has as little user attributes as possible. Why not use the same one? Well with normal app analytics, you’re subject to GDPR, deletion requests, and heightened security expectations. Errors shouldn’t be like this, errors should be pared down so far that you can safely and securely retain them forever.</p>

<p>Detection is focused on efficiently detecting occurrences of errors, not making them easy to debug.</p>

<p>Each error category may have different and multiple avenues of diagnosis. Some likely have none at all, which is why I’m writing this. There are three general mechanisms of delivering errors:</p>

<ol>
  <li>
    <p>3rd party: things which you don’t own, manage, nor control like Google Play console Vitals, app store reviews, star ratings, etc.</p>
  </li>
  <li>
    <p>Out-of-band: your app aggregates, interprets, and potentially sends error details to a separate backend system than what your app primarily interacts with. Common examples are analytics systems, custom crash interpreters, and local error aggregation.</p>
  </li>
  <li>
    <p>In-band: your app includes some details of errors within the primary client/server RPC, your server may include some global error information within the primary RPC’s as well.</p>
  </li>
</ol>

<h1 id="diagnosis">Diagnosis</h1>

<p>There is a middle ground between zero context and debug-level context. You need to know the stack trace, view hierarchy, and some information about the device and user which stand-out to help diagnose a problem. You can quickly detect that a crash happened, but diagnosing  why may require far more insight. The middle ground is in the aggregable context of an error: stack trace, activity or view, device type, etc. You should not require a full core dump for understanding that the app crashed, that’s for understanding why.</p>

<p>Diagnosis requires great context for the error. This context can come from a few different places:</p>

<ol>
  <li>
    <p>The app itself: it can submit an error report to you.</p>
  </li>
  <li>
    <p>The backend(s): maybe your web servers sent an HTTP 500 response which caused the user-visible error.</p>
  </li>
  <li>
    <p>The session, user, or customer: if you can determine which account experienced a specific error, you may be able to reconstruct context from their user data.</p>
  </li>
</ol>

<p>Ever-increasing context is not the path to perfect diagnosis, there should always be a trade-off between increasing context and value to the user. If you want to collect full core dumps, remember that transmitting 4GB+ of data over a cell network can bankrupt some people, so don’t do it. Adding the device manufacturer or even device model to the error report may give you critical context for some problems.</p>

<p>Every error may be different. There’s no magical set of fields to supply in an error to give sufficient context to efficiently diagnose. You, as the app developer, should have the best idea of what pieces of context will be most useful in each subsystem of your app. For example, if you’re implementing the password change flow, you may want to include metadata about the account’s password: how old was it, how many times has this person changed it recently, or even how they arrived at the screen if you have multiple paths like an email deep link.</p>

<h1 id="anticipation">Anticipation</h1>

<p>There are many scenarios where a quantity in the application may be increasing or decreasing toward a critical threshold: resident set size, battery power, CPU cycles, cloud storage quota, etc. There are a few strategies to know if these are about to produce an error:</p>

<ul>
  <li>
    <p>High/low water marks</p>
  </li>
  <li>
    <p>Quotas</p>
  </li>
  <li>
    <p>Correlational estimates: flow complexity, runtime, user sophistication, and others I have yet to learn or hear about.</p>
  </li>
</ul>

<p>High/low watermarks: if your application has a finite amount of a resource, you should implement it so that it operates optimally under a low amount of the resources. Once it reaches a threshold of usage of the resource, it can start disabling or degrading functionality. It’s common to implement at least two thresholds here: high and low watermarks. If the low watermark is reached, the app can start to refuse new allocations into the resource pool, or delay unnecessary work. The app should still generally be functioning as expected, though maybe a bit slower. Upon reaching the high water mark, the app should outright disable functionality and evict low priority resource usage.</p>

<p>Example: Let’s say an app takes, uploads, and views images. One of the resources it’s sure to use is local disk: to cache captured images, cache or even synchronize downloaded images, and to cache different image resolutions or image effect applications. If we set a low watermark of 1GB local storage, the app can switch to a mode where it does all processing on only previously allocated image files. This prevents some increase in storage used. However, if we hit a high watermark, we could have the app actively de-allocate images in storage and force re-fetching, in-memory processing, or even decreased resolution or effect support.</p>

<p>Quotas: these follow a similar concept as watermarks, but they have an inverse usage expectation. Nothing should change in behavior until the quota is met, at which point there can be a user flow for increasing the quota or decreasing the usage.</p>

<p>Correlational estimates: try not to use these. If a critical error occurs, once the dust settles and everything is recovered, it’s common to wonder if you could have seen this coming. One common question in the blameless post-mortem is “Was there a leading indicator which could have predicted this issue?” This is grasping at straws. If your application is so complex that there is no direct indicator which you can add, nor but you can fix, you’re already in too deep without proper operability. Push back against adding an alert on a correlational leading indicator. Instead, try to reduce the complexity of the app around the failure.</p>

<h1 id="remediation">Remediation</h1>

<p>If we’ve detected and diagnosed an error, how do we fix it? This is where things get fun :) The general idea is to use everything at your disposal! The well-trod areas are:</p>

<ul>
  <li>
    <p>Backend remediation</p>
  </li>
  <li>
    <p>In-app feature flags</p>
  </li>
  <li>
    <p>Hotfix</p>
  </li>
</ul>

<p>I’d like to urge you to consider another: a failover app wrapper. The failover wrapper is not just a blanket exception handler, it’s as separated as possible from the icky, risky, and most problematic part of your app: your code (mine included!). There are many ways to go about this and none of them are sufficient, otherwise it would be a fail-safe. Nothing in your application’s logic should be able to get into such a bad state that the best recourse available is to harm the core user experience. You can implement this as a stale view of data in a separate process, a babysitter process which spawns your main application’s surfaces, but the best available options are indeed to use a separate process. Within your main application’s logic, you can have a mechanism of “crashing” which only exits the child process, not the wrapper which is still able to provide some useful functionality to the user. The IPC between the processes to synchronize the user’s state could be done all sorts of ways, but again needs to be overly simple so that the failover wrapper doesn’t collect the same bugs as the main app.</p>

<p>Backend remediation is what we end up with when we don’t plan for emergencies as we write the app. The backend sends some special configuration telling the app to disable something, or sends response which was carefully crafted to side-step the problematic app code. You’ll end up with some of these, and that’s a good thing.</p>

<p>In-app feature flags deserve a whole separate post. I’ll also split them out into two categories: killswitches and dials. The general idea here is to implement graceful degradation within the app. Similar to the idea above about high/low watermarks, if the app encounters some critical error, the app can have a mechanism to reduce or remove functionality. One simple example is to crash when detecting a privacy error. Rather than show any activity with a privacy problem, just exit the app. This is the bare minimum, though, because how can you fix the privacy problem and stop the app from crashing?</p>

<p>Killswitches: these are simple boolean indicators which the app checks any time it exercises some specific functionality. The functionality is skipped if the killswitch is on. Your implementation may vary in the use of true vs. false to indicate on or off, but make it consistent. These parameters should be delivered through an out-of-band RPC to configure your application, ideally changing while the app is running. Distributing killswitches is a fun distributed systems problem with many complex solutions, just be sure that you’re comfortable with the coverage and timeframe of your solution, be it polling, pushing, swarming, or other fancier options I need to learn about.</p>

<p>Warning about killswitches: if you have killswitches for very small pieces of your application, you will end up exercising code which is behind tens or hundreds of conditional killswitch checks. Nobody ever tests every combination of killswitches, because they are killswitches. Use A/B testing to control the on/off choice of these smaller, less disaster-prone surfaces of your application. Killswitches should be reserved for disabling large pieces of functionality.</p>

<p>Dials: some functionality in your application can scale up and down in demand or resource usage, both on the client and on the backend. In the example of watermarks above, image resolution could be a dial from the client to reduce resource usage on the client or on the server. Video resolution, time between polls, push batching, and staleness of data are all examples of dials which you can use to remediate huge categories of errors if you support dialing to 0.</p>

<p>Hotfix: you can always ship a new build of your app with the offending bug fixed. This takes a long time and may take a whole team of people to produce. Even then, you’re at the mercy of the app stores, which may take days or even <em>weeks</em> to finally push your update. Just be sure only to do a true hotfix: fix the code from the very commit used to build the affected version.</p>

<h1 id="the-perfect-app">The Perfect App</h1>

<p>Remember: “The perfect is the enemy of the good.” Having a product is more important than having a perfectly operable product. The flip-side also holds: having an operable product is more important than having a perfect product. Don’t go all-in on either, exercise the balance with moderation. Learn from your mistakes. Just as a competitor is about to eat your lunch by getting to market sooner, another competitor may swoop in when you inevitably have a disaster and don’t have the operational capability to recover.</p>]]></content><author><name>Adam Gray</name></author><category term="Tech" /><category term="Tech" /><summary type="html"><![CDATA[Mobile app operability is the trait of mobile apps which makes detection, diagnosis, remediation, and anticipation of errors low-cost and low-effort.]]></summary></entry><entry><title type="html">I don’t write</title><link href="http://addumb.com/2019/10/30/i-dont-write/" rel="alternate" type="text/html" title="I don’t write" /><published>2019-10-30T06:19:00+00:00</published><updated>2019-10-30T06:19:00+00:00</updated><id>http://addumb.com/2019/10/30/i-dont-write</id><content type="html" xml:base="http://addumb.com/2019/10/30/i-dont-write/"><![CDATA[<p>I don’t write, and I often get feedback that it’s hurting my career in software engineering.</p>

<p>Disclaimer: I’m butthurt about struggling to advance in my career.</p>

<p>The following is what happens when I try to write anything (this included):</p>

<ul>
  <li>I think I have a good idea</li>
  <li>I flesh it out into fairly broad strokes</li>
  <li>I start writing</li>
  <li>I doubt my idea</li>
  <li>I fear being bullied</li>
  <li>I delete what I wrote</li>
</ul>

<p><strong>Why does this hurt my career? Because I’m a software engineer.</strong> We conflate visibility and productivity in this industry, thinking the engineer who writes Medium articles all the time must know their shit because they seem to have quite a following.</p>

<p><strong>I’m not saying I shouldn’t have to broadly communicate.</strong> There’s obviously value in broadly communicating. It’s why the fucking printing press was a world-changing invention. I’m not talking about spreading the Good News here, though. I’m talking about rehashing architecture choices, workflow recommendations, production operations plans, and shit as banal as unit test coverage. Why would I bore the world or my peers with things that are trivial to search for? How could that possibly be valuable? How is that required for me to “level up”?</p>

<p>I’d like to explore why I don’t write, particularly for work. So let’s look at my steps of abandonment.</p>

<h3 id="how-to-tank-any-technical-career-by-not-communicating-enough">How to Tank Any Technical Career by “Not Communicating Enough”</h3>

<blockquote>
  <p>I think I have a good idea</p>
</blockquote>

<p>We all have good ideas. We’re humans. Each of us has a unique experience and can share ideas which others haven’t had. Stupid simple. I don’t have confidence to call many of my ideas good, but I figure some probably are good.</p>

<blockquote>
  <p>I flesh it out into fairly broad strokes</p>
</blockquote>

<p>This is actually a pretty quick process: trim it down to the bare essentials, don’t demean anybody’s work, have a strong recommendation or request of the coworker who reads it. Easy peasy.</p>

<blockquote>
  <p>I start writing</p>
</blockquote>

<p>For larger documents or plans, where more context and guiding is needed, I go off the rails pretty easily. I get bogged down in minutiae. I start trying to cover every angle of technical attack this thing may face.</p>

<blockquote>
  <p>I doubt my idea</p>
</blockquote>

<p>As I’m writing a defense without an attack, I start to convince myself that it’s not worth writing. That it contradicts something else I hadn’t seen. That it’s obvious, not novel. That I’m missing some piece of context that makes the whole thing moot.</p>

<blockquote>
  <p>I fear being bullied</p>
</blockquote>

<p>I can hear it now: “well you probably just can’t take feedback very well if you always think it’s bullying.” Sure, fine. There’s nothing I can say to that. It’s just shutting down a conversation. If that’s what you’re thinking, then this isn’t for you. Otherwise, you may have experienced this. The Super Sr. Hon. Principled Engineer XIV stepping in to the idea you had and shitting all over it. Not because they have a better idea, but because they can shoot mine down without bothering to consider it. Because one of the key words I used matched a thing they own. Because I recommended using a system they didn’t write. Because they can.</p>

<blockquote>
  <p>I delete what I wrote</p>
</blockquote>

<p>Not this time!</p>

<p>That’s all, I mostly needed to vent. But in the spirit of not deleting what I wrote, here are snippets from earlier versions of what I thought I wanted to say…</p>

<hr />

<p>I’m talking about advancing a career through sheer confidence. Faking it until you make it seems to be a requirement in software engineering career progression.</p>

<p>I do this personally, for nerd stuff or regular-ass personal stuff. I have a handful of partial drafted things I’ve written out. One weird thing about the personal ones is the audience. The audience is always me. I can’t write advice since I’ll just doubt the value of it. I mean hey, it didn’t help me the first time around, right? So I write to straighten my thoughts, to categorize and reinterpret.</p>

<p>I do this abandonment process at work almost every day. Let’s say I have an idea, or a direction shift, which I need a lot of people to get on-board with before the wheels come off of the product/app/website/project/whatever. So, I say it in a chat and a couple people agree, then I say it more broadly and get complete and utter silence. I could bring it up in a meeting, but I usually talk myself down from that by pointing out what happens every time I do: people say that’s a good idea, pretend it takes 10x the engineering time it does, and proceed to politely tell me to go fuck myself by saying that this next deadline is more important.</p>

<p>This has trained me not to share my ideas widely. It has taught me that my ideas are bad.</p>

<p>So I instead have 1:1 or small group conversations about technical direction, driving everybody to a higher bar and generally getting people excited to work on the most boring stuff possible: reliability. I do this a lot at work, pointing out subtle tactical changes and creating a new vision of whatever it is I’m working with.</p>]]></content><author><name>Adam Gray</name></author><category term="Tech" /><category term="Tech" /><summary type="html"><![CDATA[I don't write, and I often get feedback that it's hurting my career in software engineering.]]></summary></entry><entry><title type="html">I moved addumb.com into GitHub pages</title><link href="http://addumb.com/2016/03/07/i-moved-addumb-com-into-github-pages/" rel="alternate" type="text/html" title="I moved addumb.com into GitHub pages" /><published>2016-03-07T18:26:00+00:00</published><updated>2016-03-07T18:26:00+00:00</updated><id>http://addumb.com/2016/03/07/i-moved-addumb-com-into-github-pages</id><content type="html" xml:base="http://addumb.com/2016/03/07/i-moved-addumb-com-into-github-pages/"><![CDATA[<p>It was kind of fun :) Believe it or not, I actually have a couple other pages under addumb.com aside from shitty blog posts:</p>

<ul>
  <li><a href="/posts">/posts</a></li>
  <li><a href="/resume.html">my résumé</a></li>
  <li><a href="/rÃ©sumÃ©.htm">my rÃ©sumÃ©</a></li>
  <li><a href="/derp.html">some social science statistic calculator</a></li>
  <li><a href="/nato-phonetic-alphabet-tool.html">a NATO phonetic alphabet spitter-outer</a></li>
</ul>

<p>Those were all really easy to move. Moving my shitty blog from wordpress.com into GitHub pages was a little more complicated than all the guides make it seem. At a high level, you’ll do these four big steps:</p>

<ul>
  <li>Do the normal shit for <a href="https://pages.github.com/">GitHub pages</a> (make a new repo, setup a jekyll site)</li>
  <li>Import your Wordpress blog.</li>
  <li>Fix all the stuff that just broke all over the place</li>
</ul>

<h2 id="github-pages">GitHub Pages</h2>
<p>The <a href="https://help.github.com/categories/github-pages-basics/">GitHub docs</a> on this are pretty straightforward. You should end up with a site with a directory structure like this:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>tree <span class="nt">-L</span> 2
<span class="nb">.</span>
├── CNAME
├── Gemfile
├── README.md
├── _config.yml
├── _includes
│   ├── footer.html
│   ├── head.html
│   ├── header.html
│   └── sidebar.html
├── _layouts
│   ├── default.html
├── derp.html
├── index.md
├── posts.md
├── sitemap.xml
└── vendor
    └── bundle
</code></pre></div></div>
<p>The contents of CNAME here is just “addumb.com” for me. Don’t do that for yours. Gemfile only contains 2 lines right now: “source ‘https://rubygems.org’” and “gem ‘github-pages’”. README.md is just a regular GitHub readme. <code class="language-plaintext highlighter-rouge">_config.yml</code> has some annoying pieces you may need to dig up, particularly this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>markdown: kramdown
kramdown:
  input: GFM
  force_wrap: false
</code></pre></div></div>
<p>Then the <code class="language-plaintext highlighter-rouge">_includes</code> are just your site template pieces, <code class="language-plaintext highlighter-rouge">_layouts</code> just has a default for now which is basically empty:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;!DOCTYPE html&gt;
&lt;html&gt;
{% include head.html %}
&lt;body&gt;
  {% include header.html %}
  {{ content }}
  {% include footer.html %}
&lt;/body&gt;
&lt;/html&gt;
</code></pre></div></div>

<p>You should be able to run this to get your jekyll site running locally:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ bundle exec jekyll serve
</code></pre></div></div>

<h2 id="the-wordpresscom-import-stuff">The Wordpress.com Import Stuff</h2>
<p>Now that you have a fancy new GitHub Pages site up and running, it’s time to move your shitty blog over to it from Wordpress.com. The steps are very similar for other Wordpress setups, so try to read between the lines.</p>

<p><span class="label label-warning">Warning</span> Make a new branch in your repo, you’re going to totally fuck shit up.</p>

<p>Okay, then follow along here…</p>

<ol>
  <li>Login to your wordpress.com account, then your “site”, then export it all as a single XML file.</li>
  <li>
    <p>Add the <code class="language-plaintext highlighter-rouge">jekyll-import</code> dependency to your <code class="language-plaintext highlighter-rouge">Gemfile</code> where you should already have <code class="language-plaintext highlighter-rouge">github-pages</code>:</p>

    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>source 'https://rubygems.org'
gem 'jekyll-import'
gem 'github-pages'
</code></pre></div>    </div>
    <p>Don’t try to add this at the outset. Install github-pages, and then jekyll-import. They have a fake version conflict.</p>
  </li>
  <li>
    <p>Now run this to get your import started:</p>

    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bundle exec jekyll import wordpressdotcom --source-file ~/Downloads/addumb.wordpress*.xml
</code></pre></div>    </div>
    <p>This will spew all kinds of unhelpful gem dependency failures for a while. One by one, you’ll need to fix them. This step is total bullshit.</p>

    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Whoops! Looks like you need to install 'hpricot' before you can use this importer.

If you're using bundler:
  1. Add 'gem "hpricot"' to your Gemfile
  2. Run 'bundle install'

If you're not using bundler:
  1. Run 'gem install hpricot'.
</code></pre></div>    </div>
    <p>This is pretty straightforward:</p>

    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo "gem 'hpricot'" &gt;&gt; Gemfile
bundle install
bundle exec jekyll import wordpressdotcom --source-file ~/Downloads/addumb.wordpress*.xml
</code></pre></div>    </div>
    <p>and repeat until you don’t get a Gem error.</p>
  </li>
  <li>
    <p>Okay, now you have a steaming pile of malformatted “.html” files under <code class="language-plaintext highlighter-rouge">_posts</code>. Each one has “frontmatter” at the top of the page, which is put between YAML comment markers: <code class="language-plaintext highlighter-rouge">---</code>. The frontmatter is just YAML describing some specifics about each post. Go through each one and clean up the garbage left over, when you’re done it should look something like this post’s frontmatter:</p>

    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> ---
 layout: post
 title: I moved addumb.com into GitHub pages
 date: 2016-03-07 11:26:00.000000000 -07:00
 type: post
 published: true
 status: publish
 description: Moving a blog from wordpress and website from AWS to GitHub Pages.
 keywords: github pages, aws migration, wordpress export
 categories:
 - Tech
 - Tip
 tags:
 - Tech
 author:
   login: addumb
   email: adam@addumb.com
   display_name: addumb
   first_name: 'Adam'
   last_name: 'Gray'
 ---
</code></pre></div>    </div>

    <p>You should have removed garbage like this:</p>

    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> meta:
 _publicize_pending: '1'
 _edit_last: '16162427'
 _wp_old_slug: '1'
 original_post_id: '1'
</code></pre></div>    </div>
  </li>
</ol>

<!-- Atom is telling me to terminate my markdown underline from above, so here:_ -->

<h2 id="the-layouts">The Layout(s)</h2>

<p>The import created a bunch of <code class="language-plaintext highlighter-rouge">_posts</code> which reference a layout called <code class="language-plaintext highlighter-rouge">post</code>. What is that and how do you create it and make it somewhat useful? (I cannot give any advice on making it truly useful.)</p>

<p>You might want to make the post layout so that your posts aren’t rendered plainly, like <a href="/plainpost">this</a>. Here’s a starter, just plop this in <code class="language-plaintext highlighter-rouge">_layouts/post.html</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;!DOCTYPE html&gt;
&lt;html&gt;
  &lt;head&gt;
    &lt;title&gt;
    {% if page.title %} {{ page.title }}
    {% else %} {{ site.title }}
    {% endif %}
    &lt;/title&gt;
  &lt;/head&gt;
  &lt;body&gt;
    &lt;h1&gt;{{ page.title }}&lt;/h1&gt;
    {{ content }}
  &lt;/body&gt;
&lt;/html&gt;
</code></pre></div></div>

<p>That will just make a web page with your post’s title in the title bar of the browser along with a big header up top, then your unadulterated goodness underneath.</p>

<p>Next, you may want to list your posts in a few different places. Here’s how I made the sidebar/bottombar list of posts here. I made a file in <code class="language-plaintext highlighter-rouge">_includes</code> named <code class="language-plaintext highlighter-rouge">sidebar.html</code> and it’s this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Other posts...
&lt;ul&gt;
  {% for post in site.posts %}
  {% if page.url == post.url %}
  &lt;li&gt;&amp;raquo; {{post.title}}&lt;/li&gt;
  {% else %}
  &lt;li&gt;&lt;a href="{{post.url}}"&gt;{{post.title}}&lt;/a&gt;&lt;/li&gt;
  {% endif %}
  {% endfor %}
&lt;/ul&gt;
</code></pre></div></div>

<p>Then you can update your <code class="language-plaintext highlighter-rouge">post.html</code> layout to include this on the right-hand side however you’d like.</p>

<h2 id="seo">SEO</h2>

<p>Ha! I have no idea what I’m talking about here, but generally do these things:</p>

<ol>
  <li>Make a <a href="https://github.com/addumb/addumb.github.io/blob/master/sitemap.xml">sitemap.xml</a>.</li>
  <li>Add descriptions and keywords to your site and to each post. Check the head layout on this site: <a href="https://github.com/addumb/addumb.github.io/blob/master/_includes/head.html">_includes/head.html</a></li>
  <li>Add some jQuery, that helps, right??</li>
  <li>Bootstrap something, is that still fashionable?</li>
</ol>

<h2 id="walk-away">Walk Away</h2>

<p>That’s it! That’s how I made this thing. It was fun :)</p>]]></content><author><name>Adam Gray</name></author><category term="Tech" /><category term="Tip" /><category term="Tech" /><summary type="html"><![CDATA[Moving a blog from wordpress and website from AWS to GitHub Pages.]]></summary></entry><entry><title type="html">Quick Debian Backporting</title><link href="http://addumb.com/2014/03/10/quick-debian-backporting/" rel="alternate" type="text/html" title="Quick Debian Backporting" /><published>2014-03-10T20:41:22+00:00</published><updated>2014-03-10T20:41:22+00:00</updated><id>http://addumb.com/2014/03/10/quick-debian-backporting</id><content type="html" xml:base="http://addumb.com/2014/03/10/quick-debian-backporting/"><![CDATA[<h2>Backporting</h2>
<p>Suppose you're running Ubuntu and want to get a newer version of a package than what's provided by Ubuntu. The process of re-building a newer version of a package from a newer version of the Distro is called "backporting." This can become <em>very dangerous</em> if you start backporting highly-dependent packages like python, so just don't do that. Try to keep to the leaves of your distro's dependency tree.</p>
<h2>Example</h2>
<p>As a specific example, I run Ubuntu 12.04 ("precise"), which comes with <code>nose</code> version 1.1.2-3, but I want something newer so that I can use the <code>--cover-xml</code> option! This option was introduced at <a href="https://github.com/nose-devs/nose/commit/868ce889f1b6cf6423fdd56fbc90058c2f4895d8" target="_blank">https://github.com/nose-devs/nose/commit/868ce889f1b6cf6423fdd56fbc90058c2f4895d8</a> and first released in 1.2.</p>
<p>I want to backport <code>nose &gt;= 1.2</code> to Ubuntu Precise.</p>
<h2>Do This</h2>
<ul>
<li>Go to <a href="http://packages.ubuntu.com/source/precise/nose" target="_blank">http://packages.ubuntu.com/source/precise/nose</a> and click through the different versions of Ubuntu on the top right until a version we want is there.</li>
<li><a href="http://packages.ubuntu.com/source/saucy/nose" target="_blank">Saucy has nose 1.3.0</a>, so I'll use that!</li>
<li>Copy the URL to the .dsc file under the "Download Nose" section: <code>http://archive.ubuntu.com/ubuntu/pool/main/n/nose/nose_1.3.0-2.dsc</code></li>
<li><code>sudo apt-get install devscripts pbuilder</code></li>
<li><code>sudo pbuilder --create --distribution precise</code></li>
<li>Get some coffee, this will take a while.</li>
<li><code>dget http://archive.ubuntu.com/ubuntu/pool/main/n/nose/nose_1.3.0-2.dsc</code></li>
<li><code>sudo pbuilder --build --distribution precise nose_1.3.0-2.dsc</code></li>
</ul>
<p>If all went well, you should now have some .deb files in /var/cache/builder/result!</p>
<p>If you're like me and everything did NOT go well because you tried this in a VM with only 512MB RAM, you probably got some test failures like this:<br />
<pre>
======================================================================
FAIL: Doctest: test_issue270.rst
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/lib/python3.2/doctest.py", line 2153, in runTest
raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for test_issue270.rst
File "/tmp/buildd/nose-1.3.0/build/tests/unit_tests/test_issue270.rst", line 0
----------------------------------------------------------------------
File "/tmp/buildd/nose-1.3.0/build/tests/unit_tests/test_issue270.rst", line 17, in test_issue270.rst
Failed example:
run(argv=argv, plugins=[MultiProcess()])
Exception raised:
Traceback (most recent call last):
File "/usr/lib/python3.2/doctest.py", line 1288, in __run
compileflags, 1), test.globs)
File "", line 1, in
run(argv=argv, plugins=[MultiProcess()])
File "/tmp/buildd/nose-1.3.0/build/tests/nose/plugins/plugintest.py", line 412, in run_buffered
run(*arg, **kw)
File "/tmp/buildd/nose-1.3.0/build/tests/nose/plugins/plugintest.py", line 372, in run
buffer = Buffer()
File "/tmp/buildd/nose-1.3.0/build/tests/nose/plugins/plugintest.py", line 130, in __init__
self.__queue = Manager().Queue()
File "/usr/lib/python3.2/multiprocessing/__init__.py", line 98, in Manager
m.start()
File "/usr/lib/python3.2/multiprocessing/managers.py", line 527, in start
self._process.start()
File "/usr/lib/python3.2/multiprocessing/process.py", line 132, in start
self._popen = Popen(self)
File "/usr/lib/python3.2/multiprocessing/forking.py", line 121, in __init__
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
----------------------------------------------------------------------
</pre>
<p><span style="font-size:1.5em;line-height:1.5em;">Warning!</span></p>
<p>You should be very careful when installing them since they are unvalidated backports and may break things unexpectedly. Keeping to dependency tree leaves is one way to mitigate this, since nothing in the distro itself depends them. <em>It is then up to you</em> to make sure any software you use works nicely with the backported software. This is not limited just .debs, but could be Python applications, production services, or even just some scripts you whipped up and forgot about until you need them to bring the site back up.</p>]]></content><author><name>Adam Gray</name></author><summary type="html"><![CDATA[Backporting Suppose you're running Ubuntu and want to get a newer version of a package than what's provided by Ubuntu. The process of re-building a newer version of a package from a newer version of the Distro is called "backporting." This can become very dangerous if you start backporting highly-dependent packages like python, so just don't do that. Try to keep to the leaves of your distro's dependency tree. Example As a specific example, I run Ubuntu 12.04 ("precise"), which comes with nose version 1.1.2-3, but I want something newer so that I can use the --cover-xml option! This option was introduced at https://github.com/nose-devs/nose/commit/868ce889f1b6cf6423fdd56fbc90058c2f4895d8 and first released in 1.2. I want to backport nose &gt;= 1.2 to Ubuntu Precise. Do This Go to http://packages.ubuntu.com/source/precise/nose and click through the different versions of Ubuntu on the top right until a version we want is there. Saucy has nose 1.3.0, so I'll use that! Copy the URL to the .dsc file under the "Download Nose" section: http://archive.ubuntu.com/ubuntu/pool/main/n/nose/nose_1.3.0-2.dsc sudo apt-get install devscripts pbuilder sudo pbuilder --create --distribution precise Get some coffee, this will take a while. dget http://archive.ubuntu.com/ubuntu/pool/main/n/nose/nose_1.3.0-2.dsc sudo pbuilder --build --distribution precise nose_1.3.0-2.dsc If all went well, you should now have some .deb files in /var/cache/builder/result! If you're like me and everything did NOT go well because you tried this in a VM with only 512MB RAM, you probably got some test failures like this: ====================================================================== FAIL: Doctest: test_issue270.rst ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/lib/python3.2/doctest.py", line 2153, in runTest raise self.failureException(self.format_failure(new.getvalue())) AssertionError: Failed doctest test for test_issue270.rst File "/tmp/buildd/nose-1.3.0/build/tests/unit_tests/test_issue270.rst", line 0 ---------------------------------------------------------------------- File "/tmp/buildd/nose-1.3.0/build/tests/unit_tests/test_issue270.rst", line 17, in test_issue270.rst Failed example: run(argv=argv, plugins=[MultiProcess()]) Exception raised: Traceback (most recent call last): File "/usr/lib/python3.2/doctest.py", line 1288, in __run compileflags, 1), test.globs) File "", line 1, in run(argv=argv, plugins=[MultiProcess()]) File "/tmp/buildd/nose-1.3.0/build/tests/nose/plugins/plugintest.py", line 412, in run_buffered run(*arg, **kw) File "/tmp/buildd/nose-1.3.0/build/tests/nose/plugins/plugintest.py", line 372, in run buffer = Buffer() File "/tmp/buildd/nose-1.3.0/build/tests/nose/plugins/plugintest.py", line 130, in __init__ self.__queue = Manager().Queue() File "/usr/lib/python3.2/multiprocessing/__init__.py", line 98, in Manager m.start() File "/usr/lib/python3.2/multiprocessing/managers.py", line 527, in start self._process.start() File "/usr/lib/python3.2/multiprocessing/process.py", line 132, in start self._popen = Popen(self) File "/usr/lib/python3.2/multiprocessing/forking.py", line 121, in __init__ self.pid = os.fork() OSError: [Errno 12] Cannot allocate memory ---------------------------------------------------------------------- Warning! You should be very careful when installing them since they are unvalidated backports and may break things unexpectedly. Keeping to dependency tree leaves is one way to mitigate this, since nothing in the distro itself depends them. It is then up to you to make sure any software you use works nicely with the backported software. This is not limited just .debs, but could be Python applications, production services, or even just some scripts you whipped up and forgot about until you need them to bring the site back up.]]></summary></entry><entry><title type="html">Considering different data systems?</title><link href="http://addumb.com/2013/11/14/considering-different-data-systems/" rel="alternate" type="text/html" title="Considering different data systems?" /><published>2013-11-15T01:13:43+00:00</published><updated>2013-11-15T01:13:43+00:00</updated><id>http://addumb.com/2013/11/14/considering-different-data-systems</id><content type="html" xml:base="http://addumb.com/2013/11/14/considering-different-data-systems/"><![CDATA[<p>Please feel free to agree or disagree. I'm likely downright wrong about a fair amount of this...</p>
<h1>CAP</h1>
<div>The first very high-level consideration is where would you like to sit on the CAP tradeoffs? Consistency across the entire dataset has some pretty stringent requirements. Availability of the dataset was traditionally thought of as a binary concept: is the DB up or down? But in these consumer internet days, it sometimes makes things a lot more relaxed if you allow yourself to have partial availability in your datasets, e.g. users hashing to ^0.* through ^4.* are down, or users hashing to 5 thru 9 are still OK. The partition tolerance part is where things get downright crazy... I'll try to get into some of those more awesomer and recent tradeoffs shortly and poorly.</div>
<div></div>
<div>CA: Highly consistent and available data has no (network) partition tolerance. Another way to say this is you have a single master which you failover when it dies. This is a very familiar situation for traditional DBAs and people who can afford Oracle :)</div>
<div></div>
<div>AP: Highly available and partition-tolerant systems have high consistency costs. A typical setup here is offline processing: you have a place that accepts new writes, but nobody can read them until you copy out the data to <b>all</b> servers. That is strictly all servers, not "each server."</div>
<div></div>
<div>CP: Highly consistent and partition-tolerant systems have low availability, per their trade-off. An example here is actually zookeeper :) If you tried to use zookeeper as a general purpose datastore, you will notice very low "availability." As in your write latencies can approach infinity, which is indistinguishable from the service being down.</div>
<div></div>
<div>People usually end up picking something in-between these 3 wonky states.</div>
<h1>Sharding</h1>
<div>Extremely broadly, "sharding" is the process of dividing your dataset into smaller pieces (shards) each potentially available on a different computer. It's a way to compromize some consistency for partition tolerance. You get the additional benefit of giving you availabilities between "yes" and "no." The typical way to start sharding is to take an incoming piece of data from the system's input (a username, e.g.) and run it through a hashing function and designating ranges in the hashing function output to be served by different shards.</div>
<div></div>
<div>One of the biggest under-stated assumptions of sharding is that the key of data you are hashing on is an equally-valued range. If, for example, your company is a ticket-sales company, you probably see each individual ticket sold as roughly as unimportant as the next. A <strong>terrible</strong> key to shard on would be the event the tickets are for. Why would the ticket be good and the event be bad? In a shard failure event (partition or availability gone), you will lose acess to a specific percentage of your keys. If you keyed on event and the Superbowl happened to be one of those keys, you are going to lose a LOT of money. If you keyed on ticket, you will lose access to a specific percentage of tickets for all events, but you will <em>retain access</em> to many tickets for the Superbowl! The take-away here is to <b>over-shard</b> so that your unit of failure is small and to <strong>shard on equal business value data</strong>.</div>
<div></div>
<div><span style="line-height:1.5;">So now that you know what you wish you could shard on, look back to the product and try to find what I call a natrually ocurring key that's close to it. An example is you want to shard on user_id which is an integer you create, but users keep entering their username! The fix is to hash on the username (which you receive in the POST) to make lookups O(1) instead of O(your user_id lookup stack) as the userbase grows.</span></div>]]></content><author><name>Adam Gray</name></author><summary type="html"><![CDATA[Please feel free to agree or disagree. I'm likely downright wrong about a fair amount of this... CAP The first very high-level consideration is where would you like to sit on the CAP tradeoffs? Consistency across the entire dataset has some pretty stringent requirements. Availability of the dataset was traditionally thought of as a binary concept: is the DB up or down? But in these consumer internet days, it sometimes makes things a lot more relaxed if you allow yourself to have partial availability in your datasets, e.g. users hashing to ^0.* through ^4.* are down, or users hashing to 5 thru 9 are still OK. The partition tolerance part is where things get downright crazy... I'll try to get into some of those more awesomer and recent tradeoffs shortly and poorly. CA: Highly consistent and available data has no (network) partition tolerance. Another way to say this is you have a single master which you failover when it dies. This is a very familiar situation for traditional DBAs and people who can afford Oracle :) AP: Highly available and partition-tolerant systems have high consistency costs. A typical setup here is offline processing: you have a place that accepts new writes, but nobody can read them until you copy out the data to all servers. That is strictly all servers, not "each server." CP: Highly consistent and partition-tolerant systems have low availability, per their trade-off. An example here is actually zookeeper :) If you tried to use zookeeper as a general purpose datastore, you will notice very low "availability." As in your write latencies can approach infinity, which is indistinguishable from the service being down. People usually end up picking something in-between these 3 wonky states. Sharding Extremely broadly, "sharding" is the process of dividing your dataset into smaller pieces (shards) each potentially available on a different computer. It's a way to compromize some consistency for partition tolerance. You get the additional benefit of giving you availabilities between "yes" and "no." The typical way to start sharding is to take an incoming piece of data from the system's input (a username, e.g.) and run it through a hashing function and designating ranges in the hashing function output to be served by different shards. One of the biggest under-stated assumptions of sharding is that the key of data you are hashing on is an equally-valued range. If, for example, your company is a ticket-sales company, you probably see each individual ticket sold as roughly as unimportant as the next. A terrible key to shard on would be the event the tickets are for. Why would the ticket be good and the event be bad? In a shard failure event (partition or availability gone), you will lose acess to a specific percentage of your keys. If you keyed on event and the Superbowl happened to be one of those keys, you are going to lose a LOT of money. If you keyed on ticket, you will lose access to a specific percentage of tickets for all events, but you will retain access to many tickets for the Superbowl! The take-away here is to over-shard so that your unit of failure is small and to shard on equal business value data. So now that you know what you wish you could shard on, look back to the product and try to find what I call a natrually ocurring key that's close to it. An example is you want to shard on user_id which is an integer you create, but users keep entering their username! The fix is to hash on the username (which you receive in the POST) to make lookups O(1) instead of O(your user_id lookup stack) as the userbase grows.]]></summary></entry><entry><title type="html">I Moved Addumb.com into AWS</title><link href="http://addumb.com/2013/03/18/i-moved-addumb-com-into-aws/" rel="alternate" type="text/html" title="I Moved Addumb.com into AWS" /><published>2013-03-18T15:27:14+00:00</published><updated>2013-03-18T15:27:14+00:00</updated><id>http://addumb.com/2013/03/18/i-moved-addumb-com-into-aws</id><content type="html" xml:base="http://addumb.com/2013/03/18/i-moved-addumb-com-into-aws/"><![CDATA[<p>I decided that I should put my website in the same place I'm supposed to be putting things for work. Every web-oriented SysAdmin (or Systems Engineer, or "DevOp") these days deals with AWS or its competitors (hopefully a bit of a couple providers). I took the plunge yesterday and finally terminated my account with <strong>Yahoo! Web Hosting</strong>. Yes, that's right. I had been kind of embarrased by it, but hey, it worked OK. In fact, my new setup is going to work <em>much much worse</em> in terms of reliability and performance.</p>
<h1>Before</h1>
<p>The setup for addumb.com is really simple. I don't want to have to deal with it daily, weekly, nor even monthly. Here's how I had various services split up while hosted in Yahoo!.</p>
<ul>
<li><span style="line-height:14px;"><strong>DNS</strong>: Registered and hosted through Yahoo!.</span></li>
<li><strong>Email</strong>: MX records pointing to Google Apps for Domains.</li>
<li><strong>HTTP</strong>/web: Yahoo! Web Hosting. A static mini site plus a Wordpress "plugin" through Y!WH.</li>
<li>Everything else: hosted on my home desktop with a few port forwards poked through my router.</li>
</ul>
<h1>After</h1>
<p>The new setup for addumb.com is like this:</p>
<ul>
<li><span style="line-height:14px;"><strong>DNS registrar</strong>: namecheap.com because, well, they're cheap.</span></li>
<li><strong>DNS hosting</strong>: Amazon Route 53.</li>
<li><strong>Email</strong>: MX records just like before, pointing to Google Apps for Domains because it remains awesome and is what everybody should do (qualifiers galore).</li>
<li><strong>HTTP</strong>/web: a <strong>single</strong> EC2 instance. This is a <em><strong>terrible idea</strong></em>.</li>
<li>Everything else: a playground in AWS.</li>
</ul>
<h1>The Real Reason</h1>
<p>The real reason I moved my tiny site is so that I can get more flexibility. The "Everything else" in the lists above are really the only reason I even have <em>addumb.com</em>. I spend a fair amount of time learning new application stacks, familiarizing myself with each one's development and deployment patterns. I mostly do this because I'm a nerd and it's <strong>fun.</strong> Also so that I'm not completely lost when Developer X approaches me to deploy Application Y on technology stack Z this afternoon. I need to know what questions to ask, what jargon to use, and how to communicate the needs of performance, reliability and cost.</p>
<p>I'm self-hosting addumb.com sort of similar to how I would recommend hosting any new web property that has similar needs. My needs are just very different than most consumer web properties.</p>
<h1>I'm a Cheapskate</h1>
<p>So why on earth would I put all of addumb.com on a single EC2 instance, let alone one with only ephemeral storage?! Just like any web property, I have certain targets in cost, reliability and performance:<br />
<a href="/assets/tradeoff1.png"><img alt="fast/cheap/available tradeoff" src="/assets/tradeoff1.png" /></a></p>
<p>But my personal web site is very different than a real business. Most people end up in the blue area here, but you can see where I pick in the trade-offs:<br />
<a href="/assets/heapskate.png"><img alt="cheap sonofabitch" src="/assets/cheapskate.png" /></a></p>
<p>(pardon the <a title="keming" href="http://fuckyeahkeming.com/" target="_blank">kerning</a>) A direct result of me being a cheapskate is that my website will be slower and down more often than most websites you'll see. However, I saved a bundle on my hosting costs! $10/mo versus $20/mo! That's like an extra latte <em>each week!</em></p>
<h1>Damage Control</h1>
<p>To help make this less of a <em><strong>terrible idea</strong></em>, I'm putting the whole site in some git repos and hooking as many things as possible into the post-receive hook in the git repos on the EC2 instance(s). Basically, I force a puppet run for every git push, and try to make that puppet run deal with all the important things, as you can kiiiinda see here.</p>
<pre>#!/bin/bash
#.git/hooks/post-receive example
unset GIT_DIR
cd ..
git reset --hard HEAD
sudo puppet apply --modulepath puppet/modules puppet/init.pp
#puppet/init.pp
node default {
  include nginx
  include flask
  file { ['/var/www','/var/www/html']:
    ensure =&gt; directory,
    owner  =&gt; root,
    group  =&gt; root,
    mode   =&gt; 1766
  }
  exec {'rsync-of-failure':
    command =&gt; '/usr/bin/sudo /usr/bin/rsync -a --delete webroot/* /var/www/html/',
    require =&gt; File['/var/www/html']
  }
}</pre>
<p>Note: See that last exec? Don't ever do that. I'm being sloppy and only hurting myself.</p>
<h1>Don't Try This At Home</h1>
<p>Nobody should jump to any solution for any technical problem without fully understanding (or making a reasonable attempt to understand) the tradeoffs involved and other viable solutions. The increase in flexibility I get from self-hosting (in AWS or some other provider) outweigh the slight increase in cost and decrease in availability and performance for my own website. Mostly because it doesn't matter to anybody but myself. I mean come on, nobody else is even READING this, right?</p>]]></content><author><name>Adam Gray</name></author><summary type="html"><![CDATA[I decided that I should put my website in the same place I'm supposed to be putting things for work. Every web-oriented SysAdmin (or Systems Engineer, or "DevOp") these days deals with AWS or its competitors (hopefully a bit of a couple providers). I took the plunge yesterday and finally terminated my account with Yahoo! Web Hosting. Yes, that's right. I had been kind of embarrased by it, but hey, it worked OK. In fact, my new setup is going to work much much worse in terms of reliability and performance. Before The setup for addumb.com is really simple. I don't want to have to deal with it daily, weekly, nor even monthly. Here's how I had various services split up while hosted in Yahoo!. DNS: Registered and hosted through Yahoo!. Email: MX records pointing to Google Apps for Domains. HTTP/web: Yahoo! Web Hosting. A static mini site plus a Wordpress "plugin" through Y!WH. Everything else: hosted on my home desktop with a few port forwards poked through my router. After The new setup for addumb.com is like this: DNS registrar: namecheap.com because, well, they're cheap. DNS hosting: Amazon Route 53. Email: MX records just like before, pointing to Google Apps for Domains because it remains awesome and is what everybody should do (qualifiers galore). HTTP/web: a single EC2 instance. This is a terrible idea. Everything else: a playground in AWS. The Real Reason The real reason I moved my tiny site is so that I can get more flexibility. The "Everything else" in the lists above are really the only reason I even have addumb.com. I spend a fair amount of time learning new application stacks, familiarizing myself with each one's development and deployment patterns. I mostly do this because I'm a nerd and it's fun. Also so that I'm not completely lost when Developer X approaches me to deploy Application Y on technology stack Z this afternoon. I need to know what questions to ask, what jargon to use, and how to communicate the needs of performance, reliability and cost. I'm self-hosting addumb.com sort of similar to how I would recommend hosting any new web property that has similar needs. My needs are just very different than most consumer web properties. I'm a Cheapskate So why on earth would I put all of addumb.com on a single EC2 instance, let alone one with only ephemeral storage?! Just like any web property, I have certain targets in cost, reliability and performance: But my personal web site is very different than a real business. Most people end up in the blue area here, but you can see where I pick in the trade-offs: (pardon the kerning) A direct result of me being a cheapskate is that my website will be slower and down more often than most websites you'll see. However, I saved a bundle on my hosting costs! $10/mo versus $20/mo! That's like an extra latte each week! Damage Control To help make this less of a terrible idea, I'm putting the whole site in some git repos and hooking as many things as possible into the post-receive hook in the git repos on the EC2 instance(s). Basically, I force a puppet run for every git push, and try to make that puppet run deal with all the important things, as you can kiiiinda see here. #!/bin/bash #.git/hooks/post-receive example unset GIT_DIR cd .. git reset --hard HEAD sudo puppet apply --modulepath puppet/modules puppet/init.pp #puppet/init.pp node default { include nginx include flask file { ['/var/www','/var/www/html']: ensure =&gt; directory, owner =&gt; root, group =&gt; root, mode =&gt; 1766 } exec {'rsync-of-failure': command =&gt; '/usr/bin/sudo /usr/bin/rsync -a --delete webroot/* /var/www/html/', require =&gt; File['/var/www/html'] } } Note: See that last exec? Don't ever do that. I'm being sloppy and only hurting myself. Don't Try This At Home Nobody should jump to any solution for any technical problem without fully understanding (or making a reasonable attempt to understand) the tradeoffs involved and other viable solutions. The increase in flexibility I get from self-hosting (in AWS or some other provider) outweigh the slight increase in cost and decrease in availability and performance for my own website. Mostly because it doesn't matter to anybody but myself. I mean come on, nobody else is even READING this, right?]]></summary></entry><entry><title type="html">Truth In Distributed Systems</title><link href="http://addumb.com/2012/02/24/truth-in-distributed-systems/" rel="alternate" type="text/html" title="Truth In Distributed Systems" /><published>2012-02-24T22:14:33+00:00</published><updated>2012-02-24T22:14:33+00:00</updated><id>http://addumb.com/2012/02/24/truth-in-distributed-systems</id><content type="html" xml:base="http://addumb.com/2012/02/24/truth-in-distributed-systems/"><![CDATA[<p>I need to write a distributed computer systems post about the concept of "truth."</p>
<p>The gist of it goes like this: Suppose you have a group of servers (A) that need data from another group of servers (B). Each member of A could talk to whichever member of B it likes. Thing is, each member of A could see a member of B fail independently. How should the other members of A know if that member of B failed? The truth of the matter is that a member of B failed... we can get into arguments about "How dead is it?" but suffice it to say you no longer want to consider it a valid member of B.</p>
<p>There are a few ways to prevent other members of A from talking to the failed member of B, that I can see:</p>
<ul>
<li>Send traffic through a bottleneck that determines health of B (a load balancer pair, e.g.)</li>
<li>Have the members of A tell each other about the failure</li>
<li>Don't ;)</li>
</ul>
<p>So... thing is truth is relative in this situation. We have servers derp1 and derp2 are members of the client pool A and they are both trying to talk to herp1, a member of B. derp1 and derp2 can independently decide herp1 is dead. The event where derp1 discovers herp1 is dead is separate from the event when derp2 discovers herp1 is dead. It will take a non-zero amount of time for that information to go anywhere. In this situation, the bottleneck or load balancer is going through this same process, it's just been deigned Arbiter of Health of Pool B.</p>
<p>It is not an option to have all members of pool A know immediately when herp1 fails, thanks to special relativity. The time it takes to inform all members of A increases with the size of A.</p>
<p>I prefer to deal with this sort of truth-y information retrieval (e.g. "What are my options for getting data out of B right now?") as a trade-off between immediate single-point knowledge on one side and eventual distributed quorum on the other side. It's mostly a matter of how long you can wait to get "good" information out of pool B. If you need it RIGHT NOW, then you'll need to go through an aggregation point like a load balancer. If you can deal with some "stale" or "bad" responses, then you should relax your requirements and let the clients work it out amongst themselves.</p>]]></content><author><name>Adam Gray</name></author><summary type="html"><![CDATA[I need to write a distributed computer systems post about the concept of "truth." The gist of it goes like this: Suppose you have a group of servers (A) that need data from another group of servers (B). Each member of A could talk to whichever member of B it likes. Thing is, each member of A could see a member of B fail independently. How should the other members of A know if that member of B failed? The truth of the matter is that a member of B failed... we can get into arguments about "How dead is it?" but suffice it to say you no longer want to consider it a valid member of B. There are a few ways to prevent other members of A from talking to the failed member of B, that I can see: Send traffic through a bottleneck that determines health of B (a load balancer pair, e.g.) Have the members of A tell each other about the failure Don't ;) So... thing is truth is relative in this situation. We have servers derp1 and derp2 are members of the client pool A and they are both trying to talk to herp1, a member of B. derp1 and derp2 can independently decide herp1 is dead. The event where derp1 discovers herp1 is dead is separate from the event when derp2 discovers herp1 is dead. It will take a non-zero amount of time for that information to go anywhere. In this situation, the bottleneck or load balancer is going through this same process, it's just been deigned Arbiter of Health of Pool B. It is not an option to have all members of pool A know immediately when herp1 fails, thanks to special relativity. The time it takes to inform all members of A increases with the size of A. I prefer to deal with this sort of truth-y information retrieval (e.g. "What are my options for getting data out of B right now?") as a trade-off between immediate single-point knowledge on one side and eventual distributed quorum on the other side. It's mostly a matter of how long you can wait to get "good" information out of pool B. If you need it RIGHT NOW, then you'll need to go through an aggregation point like a load balancer. If you can deal with some "stale" or "bad" responses, then you should relax your requirements and let the clients work it out amongst themselves.]]></summary></entry><entry><title type="html">Updated: MySQL 5.0 and 5.1 Side-By-Side</title><link href="http://addumb.com/2011/03/02/mysql-5-0-and-5-1-side-by-side/" rel="alternate" type="text/html" title="Updated: MySQL 5.0 and 5.1 Side-By-Side" /><published>2011-03-02T18:32:17+00:00</published><updated>2011-03-02T18:32:17+00:00</updated><id>http://addumb.com/2011/03/02/mysql-5-0-and-5-1-side-by-side</id><content type="html" xml:base="http://addumb.com/2011/03/02/mysql-5-0-and-5-1-side-by-side/"><![CDATA[<p>Cross-posted from <a href="http://devblog.meebo.com" target="_blank">http://devblog.meebo.com</a>.</p>
<p>As part of working on the Meebo Operations team, I manage a bunch of database servers at Meebo. You may remember Dave's post a couple months ago about how we're starting to use NoSQL technologies for data storage. Though CouchDB and Cassandra are more new and exciting, Meebo still relies heavily upon MySQL for much of our infrastructure.</p>
<p>56 of the ~160 DB servers I manage are running MySQL, for now. A lot of those are running an old version of MySQL: 5.0.77. The version is old mostly because it's the default provided by our Linux distribution of choice, CentOS. I’ve been in the process of upgrading to something slightly less ancient (a "non-stock" version of MySQL 5.1 built and maintained by Percona, available here) for about a year. It’s slow-going, though, mostly because the servers aren't catching fire by staying on 5.0.77, so it’s lower in my priorities.</p>
<p>A couple months ago, I split out one of our highest traffic databases onto a pair of servers running MySQL 5.1. Thing is, I set up multi-headed MySQL servers running multiple MySQL processes on the same machine. As of this writing, we have 120 MySQL  processes running on the 56 machines. This high-traffic app had a remote replica on one of these multi-headed servers. So, I had two 5.1 servers trying to replicate to a 5.0 server using an incompatible protocol. How is that possible, you ask? Well I made a mistake. When I set up the new pair of servers, I spaced and made them use a non-backwards-compatible replication protocol, Row-Based Replication.  That meant I had to upgrade part of the multi-headed server to 5.1! Uh oh!</p>
<p>I had sort of done this in the past, but never in a live, production environment. Based on past experience, I knew the steps were something like this:</p>
<p>1) Install the package(s) to a different location:<br />
<pre>
mkdir /opt/mysql51
rpm --relocate /usr/=/opt/mysql51/ -i --nodeps Percona-Server*.rpm
</pre>
Note: you may run into a problem with the /etc/my.cnf file. I recommend moving it somewhere safe so nothing clobbers it.</p>
<p>2) Make sure you specify which mysqladmin and mysqld to use in /etc/my.cnf, otherwise the "default" ones will always be used, in my case those were the 5.0.77 binaries.</p>
<p>3) This is the one that really killed me: specify the full path to the new language file you want in your configuration file (usually /etc/my.cnf on RedHatty systems)<br />
<pre>
language=/opt/mysql51/share/mysql/english
</pre>
Running two versions of MySQL side-by-side is neat and all, but it's pretty difficult to manage if you can't leverage existing tools like the SysVinit script, or even mysqld_multi. mysqld_multi is the recommended way to manage multi-headed database servers. It doesn't come with an init script, but it's pretty easy to create a "Meh, it kinda works" one based on a Fedora example init script. As for mysqld_multi, it's really not that difficult to set it up to manage any version of MySQL, so long as the RPMs were installed to non-conflicting locations. Here's my example /etc/my.cnf file:<br />
<pre>
[mysqld1]
language=/opt/mysql51/share/mysql/english
mysqld=/opt/mysql51/sbin/mysqld
mysqladmin=/opt/mysql51/bin/mysqladmin
datadir=/var/lib/mysql51
socket=/var/run/mysqld/51.sock
pid-file=/var/run/mysqld/51.pid
log-error=/var/log/mysqld-51.log
port=3306
[mysqld2]
language=/usr/share/mysql/english
mysqld=/usr/libexec/mysqld
mysqladmin=/usr/bin/mysqladmin
datadir=/var/lib/mysql50
socket=/var/run/mysqld/50.sock
pid-file=/var/run/mysqld/50.pid
log-error=/var/log/mysqld-50.log
port=3307
[mysqld_multi]
log=/var/log/mysqld.log
[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
</pre>
Note that  I'm using the format for mysqld_multi, not for a regular single-process setup. Also note that these parameters are the absolute minimum to get mysqld_multi to start the processes. It is always required to tune MySQL the rest of the way. This typically ends up being many duplicated lines in /etc/my.cnf specifying the InnoDB buffer pool size, the thread concurrency limits of infinity, etc. Also be careful not to use non-backward-compatible configuration parameters in the 5.0.x section!</p>
<p>In order to get the processes to start, you need to run mysql_install_db, but you need to use different versions, so watch out! Here's how I did that:</p>
<p>1) For the 5.0 version, since that's the package installed to the default location, I can just run this: <pre>mysql_install_db --datadir=/var/lib/mysql50 --basedir=/usr</pre></p>
<p>2) For the 5.1 version, I need to specify the other path: <pre>/opt/mysql51/bin/mysql_install_db --datadir=/var/lib/mysql51 --basedir=/opt/mysql51</pre></p>
<p>3) Finally, start the processes: <pre>mysqld_multi start</pre></p>
<p>Once the processes start happily, I can then start, stop and check the status of the MySQL processes, regardless of version.</p>
<p>There you are! You can now run any version of MySQL you like side-by-side with existing legacy processes. I did this out of necessity, but realized how much it can help with my ongoing upgrade to 5.1. This gives you the option to do the MySQL 5.0 to 5.1 upgrade almost in-place, so long as your server can handle the extra load of a command like this: <pre>mysqldump -S /var/run/mysqld/5.0.sock --all-databases | mysql -S /var/run/mysqld/5.1.sock</pre> I could even write a migration daemon that goes around our MySQL servers to upgrade them all... if I didn't know what was good for me.</p>
<p>-Adam</p>]]></content><author><name>Adam Gray</name></author><category term="Tech" /><category term="Tip" /><category term="MySQL" /><category term="Tech" /><category term="tip" /><summary type="html"><![CDATA[Cross-posted from http://devblog.meebo.com. As part of working on the Meebo Operations team, I manage a bunch of database servers at Meebo. You may remember Dave's post a couple months ago about how we're starting to use NoSQL technologies for data storage. Though CouchDB and Cassandra are more new and exciting, Meebo still relies heavily upon MySQL for much of our infrastructure. 56 of the ~160 DB servers I manage are running MySQL, for now. A lot of those are running an old version of MySQL: 5.0.77. The version is old mostly because it's the default provided by our Linux distribution of choice, CentOS. I’ve been in the process of upgrading to something slightly less ancient (a "non-stock" version of MySQL 5.1 built and maintained by Percona, available here) for about a year. It’s slow-going, though, mostly because the servers aren't catching fire by staying on 5.0.77, so it’s lower in my priorities. A couple months ago, I split out one of our highest traffic databases onto a pair of servers running MySQL 5.1. Thing is, I set up multi-headed MySQL servers running multiple MySQL processes on the same machine. As of this writing, we have 120 MySQL processes running on the 56 machines. This high-traffic app had a remote replica on one of these multi-headed servers. So, I had two 5.1 servers trying to replicate to a 5.0 server using an incompatible protocol. How is that possible, you ask? Well I made a mistake. When I set up the new pair of servers, I spaced and made them use a non-backwards-compatible replication protocol, Row-Based Replication. That meant I had to upgrade part of the multi-headed server to 5.1! Uh oh! I had sort of done this in the past, but never in a live, production environment. Based on past experience, I knew the steps were something like this: 1) Install the package(s) to a different location: mkdir /opt/mysql51 rpm --relocate /usr/=/opt/mysql51/ -i --nodeps Percona-Server*.rpm Note: you may run into a problem with the /etc/my.cnf file. I recommend moving it somewhere safe so nothing clobbers it. 2) Make sure you specify which mysqladmin and mysqld to use in /etc/my.cnf, otherwise the "default" ones will always be used, in my case those were the 5.0.77 binaries. 3) This is the one that really killed me: specify the full path to the new language file you want in your configuration file (usually /etc/my.cnf on RedHatty systems) language=/opt/mysql51/share/mysql/english Running two versions of MySQL side-by-side is neat and all, but it's pretty difficult to manage if you can't leverage existing tools like the SysVinit script, or even mysqld_multi. mysqld_multi is the recommended way to manage multi-headed database servers. It doesn't come with an init script, but it's pretty easy to create a "Meh, it kinda works" one based on a Fedora example init script. As for mysqld_multi, it's really not that difficult to set it up to manage any version of MySQL, so long as the RPMs were installed to non-conflicting locations. Here's my example /etc/my.cnf file: [mysqld1] language=/opt/mysql51/share/mysql/english mysqld=/opt/mysql51/sbin/mysqld mysqladmin=/opt/mysql51/bin/mysqladmin datadir=/var/lib/mysql51 socket=/var/run/mysqld/51.sock pid-file=/var/run/mysqld/51.pid log-error=/var/log/mysqld-51.log port=3306 [mysqld2] language=/usr/share/mysql/english mysqld=/usr/libexec/mysqld mysqladmin=/usr/bin/mysqladmin datadir=/var/lib/mysql50 socket=/var/run/mysqld/50.sock pid-file=/var/run/mysqld/50.pid log-error=/var/log/mysqld-50.log port=3307 [mysqld_multi] log=/var/log/mysqld.log [mysqld_safe] log-error=/var/log/mysqld.log pid-file=/var/run/mysqld/mysqld.pid Note that I'm using the format for mysqld_multi, not for a regular single-process setup. Also note that these parameters are the absolute minimum to get mysqld_multi to start the processes. It is always required to tune MySQL the rest of the way. This typically ends up being many duplicated lines in /etc/my.cnf specifying the InnoDB buffer pool size, the thread concurrency limits of infinity, etc. Also be careful not to use non-backward-compatible configuration parameters in the 5.0.x section! In order to get the processes to start, you need to run mysql_install_db, but you need to use different versions, so watch out! Here's how I did that: 1) For the 5.0 version, since that's the package installed to the default location, I can just run this: mysql_install_db --datadir=/var/lib/mysql50 --basedir=/usr 2) For the 5.1 version, I need to specify the other path: /opt/mysql51/bin/mysql_install_db --datadir=/var/lib/mysql51 --basedir=/opt/mysql51 3) Finally, start the processes: mysqld_multi start Once the processes start happily, I can then start, stop and check the status of the MySQL processes, regardless of version. There you are! You can now run any version of MySQL you like side-by-side with existing legacy processes. I did this out of necessity, but realized how much it can help with my ongoing upgrade to 5.1. This gives you the option to do the MySQL 5.0 to 5.1 upgrade almost in-place, so long as your server can handle the extra load of a command like this: mysqldump -S /var/run/mysqld/5.0.sock --all-databases | mysql -S /var/run/mysqld/5.1.sock I could even write a migration daemon that goes around our MySQL servers to upgrade them all... if I didn't know what was good for me. -Adam]]></summary></entry></feed>