Creating Efficient String Parsers and Tokenizers in C#

When working with text data in C#, parsing and tokenizing strings efficiently is crucial for applications such as compilers, data processing, and natural language processing. In this tutorial, we will build a solid understanding of how to create fast and efficient string parsers and tokenizers using C#.

By the end of this tutorial, you will have built a robust tokenizer from scratch, optimized it for performance, and understood key principles behind efficient parsing.

1. Understanding the Basics of Parsing and Tokenization

Before diving into the implementation, let’s clarify some core concepts:

Parsing is the process of analyzing a sequence of symbols (text, code, or data) to determine its structure.
Tokenization is the process of breaking a string into smaller units called tokens. These tokens can represent words, numbers, punctuation marks, or any meaningful elements.

For example, given the input string:

var x = 10 + 20;Code language: C# (cs)

A tokenizer might break it into the following tokens:

["var", "x", "=", "10", "+", "20", ";"]Code language: C# (cs)

Each token represents a meaningful unit that can be further processed in a parser.

2. Choosing the Right Approach for String Parsing

There are multiple ways to parse and tokenize strings in C#:

Using string.Split – Quick and simple but not ideal for complex parsing.
Using Regular Expressions (Regex) – Suitable for pattern matching but can be slow for large-scale text processing.
Manual Character-Based Parsing – Best for performance and flexibility.

In this tutorial, we’ll focus on manual character-based parsing, which is efficient and allows for custom behavior.

3. Building a Basic Tokenizer from Scratch

Let’s start by writing a simple tokenizer that breaks a string into words, numbers, and symbols.

Step 1: Define the Token Class

We first define a Token class to store token information.

public enum TokenType
{
    Identifier, // Variable names, keywords
    Number,     // Numeric values
    Operator,   // +, -, *, /
    Symbol,     // Punctuation symbols
    Whitespace, // Spaces, tabs
    Unknown     // Anything else
}

public class Token
{
    public TokenType Type { get; }
    public string Value { get; }

    public Token(TokenType type, string value)
    {
        Type = type;
        Value = value;
    }

    public override string ToString() => $"{Type}: {Value}";
}Code language: C# (cs)

Step 2: Implement the Tokenizer

Now, we create a Tokenizer class that processes an input string and extracts tokens.

using System;
using System.Collections.Generic;
using System.Text;

public class Tokenizer
{
    private readonly string _input;
    private int _position;

    public Tokenizer(string input)
    {
        _input = input;
        _position = 0;
    }

    private char CurrentChar => _position < _input.Length ? _input[_position] : '\0';
    private void Advance() => _position++;

    public List<Token> Tokenize()
    {
        List<Token> tokens = new List<Token>();

        while (_position < _input.Length)
        {
            char c = CurrentChar;

            if (char.IsWhiteSpace(c))
            {
                tokens.Add(new Token(TokenType.Whitespace, c.ToString()));
                Advance();
            }
            else if (char.IsLetter(c))
            {
                tokens.Add(ReadIdentifier());
            }
            else if (char.IsDigit(c))
            {
                tokens.Add(ReadNumber());
            }
            else if ("+-*/=;".Contains(c))
            {
                tokens.Add(new Token(TokenType.Operator, c.ToString()));
                Advance();
            }
            else
            {
                tokens.Add(new Token(TokenType.Unknown, c.ToString()));
                Advance();
            }
        }

        return tokens;
    }

    private Token ReadIdentifier()
    {
        StringBuilder sb = new StringBuilder();
        while (_position < _input.Length && char.IsLetterOrDigit(CurrentChar))
        {
            sb.Append(CurrentChar);
            Advance();
        }
        return new Token(TokenType.Identifier, sb.ToString());
    }

    private Token ReadNumber()
    {
        StringBuilder sb = new StringBuilder();
        while (_position < _input.Length && char.IsDigit(CurrentChar))
        {
            sb.Append(CurrentChar);
            Advance();
        }
        return new Token(TokenType.Number, sb.ToString());
    }
}Code language: C# (cs)

4. Testing the Tokenizer

We can now test our tokenizer with a simple main function.

using System;

class Program
{
    static void Main()
    {
        string input = "var x = 42 + 8;";
        Tokenizer tokenizer = new Tokenizer(input);
        var tokens = tokenizer.Tokenize();

        foreach (var token in tokens)
        {
            Console.WriteLine(token);
        }
    }
}Code language: C# (cs)

Expected Output

Identifier: var
Whitespace:  
Identifier: x
Whitespace:  
Operator: =
Whitespace:  
Number: 42
Whitespace:  
Operator: +
Whitespace:  
Number: 8
Symbol: ;Code language: C# (cs)

5. Optimizing the Tokenizer for Performance

While our tokenizer works well, we can improve its efficiency.

Optimization 1: Using a StringSpan Instead of Substrings

Instead of creating substrings, we can use ReadOnlySpan<char> to process text more efficiently.

using System;

public ref struct TokenizerSpan
{
    private ReadOnlySpan<char> _span;
    private int _position;

    public TokenizerSpan(ReadOnlySpan<char> span)
    {
        _span = span;
        _position = 0;
    }

    public Token? NextToken()
    {
        if (_position >= _span.Length) return null;

        char c = _span[_position];

        if (char.IsWhiteSpace(c))
        {
            _position++;
            return new Token(TokenType.Whitespace, c.ToString());
        }

        if (char.IsLetter(c))
        {
            int start = _position;
            while (_position < _span.Length && char.IsLetterOrDigit(_span[_position]))
                _position++;
            return new Token(TokenType.Identifier, _span[start.._position].ToString());
        }

        if (char.IsDigit(c))
        {
            int start = _position;
            while (_position < _span.Length && char.IsDigit(_span[_position]))
                _position++;
            return new Token(TokenType.Number, _span[start.._position].ToString());
        }

        if ("+-*/=;".Contains(c))
        {
            _position++;
            return new Token(TokenType.Operator, c.ToString());
        }

        _position++;
        return new Token(TokenType.Unknown, c.ToString());
    }
}Code language: C# (cs)

This improves performance by avoiding excessive memory allocations.

6. Expanding the Tokenizer: Handling Strings and Comments

To support more advanced parsing, we can extend our tokenizer to recognize:

String literals: "Hello World"
Comments: // This is a comment

Extending the Tokenizer

private Token ReadString()
{
    StringBuilder sb = new StringBuilder();
    Advance(); // Skip opening quote

    while (_position < _input.Length && CurrentChar != '"')
    {
        sb.Append(CurrentChar);
        Advance();
    }

    Advance(); // Skip closing quote
    return new Token(TokenType.Identifier, sb.ToString());
}

private Token ReadComment()
{
    StringBuilder sb = new StringBuilder();
    while (_position < _input.Length && CurrentChar != '\n')
    {
        sb.Append(CurrentChar);
        Advance();
    }

    return new Token(TokenType.Identifier, sb.ToString());
}Code language: C# (cs)

7. Conclusion

In this tutorial, we explored how to build an efficient tokenizer in C#. We started with a basic implementation and optimized it using Span<char>. We also extended it to support string literals and comments.

By following this approach, you can build highly efficient text parsers for different use cases like programming language interpreters, data processing pipelines, or log file analyzers.

What’s next? Try enhancing this tokenizer by adding support for floating-point numbers, multi-character operators (like >= or !=), and function call parsing!