Progress update – parser, strings and containers

Dear lord I suck at making visuals…

All this for documentation no one will read…

Instead of using an already existing code documentation generator, like Doxygen, I have decided yet again to build my own. For now I want to make sure it parses my C code (and my dear cmake macros too), then generates a nice set of HTML pages that I can browse through and marvel at…

I had to tremendously rewrite the string code, and the container code, to make little progress on the parsing side.

Strings

Manipulating strings is going to be a huge part of the code documentation generator. In the C++ I had my own string classes, which worked fine I guess (it was a crappy version of std::string).

That damn OOP

As a programmer corrupted by OOP, I was inclined to think, just make a string object, that you can use in any context. When dealing with strings rarely I guess that doesn’t matter much, if you’re constantly calling strlen, or allocating / copying memory. So underneath the string object, it can use fixed memory, allocated memory, regular C strings, who cares?

WRONG! For a parser / text generator, EVERYTHING you do is just read, write, copy, trim, search strings. So the performance gain is huge if you’re not calling strlen a million times for nothing, only allocate memory once to read the file, and use fixed arrays to store the text.

With that in mind I’ve identified the different ways I use strings, so now that code consists of functions for different types of strings.

  • the regular old C string, a const char* that is (supposed to be) null-terminated
  • string blocks, a small struct that contains the pointer to characters and a length (useful for taking only parts of an existing C string)
  • my own dynamic strings, when searching for ideas I stumbled upon sds, I really like the reasoning behind it, and how it looked so I had a go at my own implementation and it works pretty well
  • a fixed version of the dynamic string, works mostly the same, except for functions that require it to grow, but it can be used in a struct or on the stack and doesn’t require memory allocation
////////////////// Regular C strings function examples ////////////////////////

bool c_string_equal(const char *a, const char *b); /* test if two c strings are equal */
uint32 c_string_find(const char *text, char character); /* return character index in string */
bool c_string_is_numerical(const char *text); /* returns true if string has only numerical characters */
void c_string_replace(char *text, char toReplace, char replacement); /* replace character in string */

////////////////// String blocks + example functions ////////////////////////

/* block string, designed to speed up text functions when length is already known */
typedef struct {
	size_t length;
	char  *characters;
} bstr;

bstr c_string_sub(const char *text, uint32 start, uint32 count); /* return sub-part of string*/
bstr c_string_get_word(const char *text, uint32 index); /* return word from string */
bstr c_string_split(const char *text, char splitChar, uint32 index); /* split string with splitChar */
bool c_bstr_equal(const bstr *a, const bstr *b); /* test if two string blocks are equal */
bool c_bstr_equal_string(const bstr *text, const char *str); /* test if string block is equal to string */
bstr c_bstr_get_word(const bstr *text, uint32 index); /* return word from string block */
bool c_bstr_starts_with_bstr(const bstr *text, const bstr *toFind); /* return true if string block starts with text */

//////////////// Dynamic string + example functions /////////////////

/* header for cstr type, saves the current length and capacity of the string */
typedef struct {
	size_t length;
	size_t capacity;
} cstr_header;

/* cstr - C string pointer preceded by header
 * memory layout: cstr_header | data | terminator
 */
typedef char *cstr;

cstr c_cstr_alloc(size_t size); /* allocate a dynamic string */
void c_cstr_free(cstr str); /* free a previously allocated dynamic string */
cstr c_cstr_resize(cstr str, size_t size); /* resize a dynamic string */
cstr c_cstr_grow(cstr str, size_t n); /* grow cstr if needed */

cstr c_cstr_append_string(cstr str, const char *text); /* append string to cstr */

bool c_cstr_equal(const cstr a, const cstr b); /* test if two cstr are equal */
bool c_cstr_equal_string(const cstr text, const char *str); /* test if cstr is equal to string */

bool c_cstr_is_numerical(const cstr text); /* returns true if cstr has only numerical characters */

////////////////////// Fixed string + example functions ////////////

/* a fixed string is a cstr created on the stack */
typedef cstr fstr;

// // define all fstr to cstr func
#define c_fstr_to_bstr  c_cstr_to_bstr /* get string block from fstr */
#define c_fstr_length   c_cstr_length  /* return length of fstr */
#define c_fstr_capacity c_cstr_capacity /* return capacity of fstr */

void c_fstr_append_string(fstr str, const char *text); /* append string to fstr */

#define c_fstr_trim c_cstr_trim /* trim character (at start & end)*/

There are a lot more functions than I showed here, (and I’ll probably add some more), I will hopefully be able to show them all once my code documentation tool is ready.

Containers

The neat little technique I am using for the dynamic string is pretty much the same idea behind Sean Barrett’s stretchy buffer. Sean’s version was just much more elegant…
Now I have two base containers, a linked list (I don’t know if I will ever use that), and a dynamic array, based on rewriting from scratch Sean’s stb, to which I added much needed insert and remove functions.

Now just with some carefully chosen define statement to mimic functions, that dynamic array can be used as

  • vector
  • stack
  • queue

I have also made a fixed version of those containers, based on a fixed array (which works the same way as the fixed strings)

I started on adding some useful array algorithms, like reverse, some basic binary search and a sort. The sort is the horrible naive one so I’ll look into adding quick sort soon.

Below is a look at the current containers header

#ifndef C_CONTAINERS_H
#define C_CONTAINERS_H

#include "c_dynamic_array.h"
#include "c_fixed_array.h"
#include "c_list.h"
#include "c_allocation.h"

/* dynamic containers */
#define c_vector_push   CORE_DYNAMIC_ARRAY_PUSH
#define c_vector_pop    CORE_DYNAMIC_ARRAY_POP
#define c_vector_insert CORE_DYNAMIC_ARRAY_INSERT
#define c_vector_add_at CORE_DYNAMIC_ARRAY_ADD_AT
#define c_vector_remove CORE_DYNAMIC_ARRAY_REMOVE
#define c_vector_clear  CORE_DYNAMIC_ARRAY_FREE
#define c_vector_count  CORE_DYNAMIC_ARRAY_COUNT
#define c_vector_add    CORE_DYNAMIC_ARRAY_ADD

#define c_stack_add(s)   CORE_DYNAMIC_ARRAY_ADD(s, 1)
#define c_stack_push     CORE_DYNAMIC_ARRAY_PUSH
#define c_stack_pop      CORE_DYNAMIC_ARRAY_POP
#define c_stack_empty(s) (CORE_DYNAMIC_ARRAY_COUNT(s) == 0)
#define c_stack_clear	 CORE_DYNAMIC_ARRAY_FREE

#define c_queue_enqueue    CORE_DYNAMIC_ARRAY_PUSH
#define c_queue_add(q)     CORE_DYNAMIC_ARRAY_ADD(q, 1)
#define c_queue_dequeue(q) CORE_DYNAMIC_ARRAY_REMOVE(q, 0)
#define c_queue_empty(q)   (CORE_DYNAMIC_ARRAY_COUNT(q) == 0)
#define c_queue_clear	   CORE_DYNAMIC_ARRAY_FREE

/* fixed containers */
#define c_fvector        CORE_FIXED_ARRAY
#define c_fvector_push   CORE_FIXED_ARRAY_PUSH
#define c_fvector_pop    CORE_FIXED_ARRAY_POP
#define c_fvector_insert CORE_FIXED_ARRAY_INSERT
#define c_fvector_add_at CORE_FIXED_ARRAY_ADD_AT
#define c_fvector_remove CORE_FIXED_ARRAY_REMOVE
#define c_fvector_clear  CORE_FIXED_ARRAY_FREE
#define c_fvector_count  CORE_FIXED_ARRAY_COUNT
#define c_fvector_add    CORE_FIXED_ARRAY_ADD

#define c_fstack          CORE_FIXED_ARRAY
#define c_fstack_add(s)   CORE_FIXED_ARRAY_ADD(s, 1)
#define c_fstack_push     CORE_FIXED_ARRAY_PUSH
#define c_fstack_pop      CORE_FIXED_ARRAY_POP
#define c_fstack_empty(s) (CORE_FIXED_ARRAY_COUNT(s) == 0)
#define c_fstack_clear    CORE_FIXED_ARRAY_FREE

#define c_fqueue            CORE_FIXED_ARRAY
#define c_fqueue_enqueue    CORE_FIXED_ARRAY_PUSH
#define c_fqueue_add(q)     CORE_FIXED_ARRAY_ADD(q, 1)
#define c_fqueue_dequeue(q) CORE_FIXED_ARRAY_REMOVE(q, 0)
#define c_fqueue_empty(q)   (CORE_FIXED_ARRAY_COUNT(q) == 0)
#define c_fqueue_clear	    CORE_FIXED_ARRAY_FREE

#endif // C_CONTAINERS_H

I have to get cracking on making a proper map, and hashset / hashmap too.

Of course, I made numerous bugs when writing this much code.

One that was particularly horrendous, I managed to generate heap corruption, and make a simple malloc call fail!

It was actually in the dynamic string growing function, because of a wrong comparison between length and capacity, it would not reallocate the memory, but write over it as if it was bigger… That wasn’t a very fun bug hunt.

I plan to have two recurring tasks, document the code properly and add thorough unit tests. Especially for the core library and anything playing around with memory.

Code Parser

I am glad to use all those different string and container functions somewhere.

Parsing C is a bundle of joy! (No, it’s really not…) I could have just whipped up a parsing functions that just reads the C code and spit out HTML, but I figured I would have use of parsing for more than just documentation, like introspection and code generation. So I decided to approach it in layers.

Layer one – All text in a C file can be split up into:

  • Comments – multi line OR single line…
  • Preprocessor – easy except for #defines (urgh)
  • Code – the rest

Now that I have all those different blocks (and saving the line number), it makes it easier to parse the different bits.

Layer Two – Examine each layer_one block to differentiate:

  • Top-level comments with special tags (like @section, @file, …)
  • Macros or preprocessor constants
  • Functions, structs, enums etc (still working on that part…)

I’m thinking of adding a third layer specifically for structs and unions. If I have the courage I’ll add features to parse C++ code too.

Parsing Cmake files for macros was much easier (what a surprise!). Not sure I should need anything else from those files.

I started making HTML generation functions, fortunately it’s not hard, just incredibly boring. The other not-so-fun part will come with generating CSS to make it look less hideous nice and colorful.

Adjusting colors, fonts, border widths, etc. is something I don’t enjoy one bit. Thankfully I’ll have some help with that, my girlfriend will make me cool examples of HTML/CSS code doc pages, that I can just rip off and integrate into the HTML generator!

Praying the OOP away

In conclusion, I have learned a valuable lesson, How not to consider engineeri…. STOP YOU FOOL, Write some damn unit tests for memory handling code!

Hum.. I have learned TWO valuable lessons. The first is to as much as possible write unit tests for sensitive code, the second is how to not consider code engineering problems as concepts to abstract into oblivion, but carefully consider what is it you’re doing and how best to improve it.

Thanks for reading, stay tuned!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: