When working with text data in C#, parsing and tokenizing strings efficiently is crucial for applications such as compilers, data processing, and natural language processing. In this tutorial, we will build a solid understanding of how to create fast and efficient string parsers and tokenizers using C#.
By the end of this tutorial, you will have built a robust tokenizer from scratch, optimized it for performance, and understood key principles behind efficient parsing.
1. Understanding the Basics of Parsing and Tokenization
Before diving into the implementation, let’s clarify some core concepts:
- Parsing is the process of analyzing a sequence of symbols (text, code, or data) to determine its structure.
- Tokenization is the process of breaking a string into smaller units called tokens. These tokens can represent words, numbers, punctuation marks, or any meaningful elements.
For example, given the input string:
var x = 10 + 20;
Code language: C# (cs)
A tokenizer might break it into the following tokens:
["var", "x", "=", "10", "+", "20", ";"]
Code language: C# (cs)
Each token represents a meaningful unit that can be further processed in a parser.
2. Choosing the Right Approach for String Parsing
There are multiple ways to parse and tokenize strings in C#:
- Using
string.Split
– Quick and simple but not ideal for complex parsing. - Using Regular Expressions (
Regex
) – Suitable for pattern matching but can be slow for large-scale text processing. - Manual Character-Based Parsing – Best for performance and flexibility.
In this tutorial, we’ll focus on manual character-based parsing, which is efficient and allows for custom behavior.
3. Building a Basic Tokenizer from Scratch
Let’s start by writing a simple tokenizer that breaks a string into words, numbers, and symbols.
Step 1: Define the Token Class
We first define a Token
class to store token information.
public enum TokenType
{
Identifier, // Variable names, keywords
Number, // Numeric values
Operator, // +, -, *, /
Symbol, // Punctuation symbols
Whitespace, // Spaces, tabs
Unknown // Anything else
}
public class Token
{
public TokenType Type { get; }
public string Value { get; }
public Token(TokenType type, string value)
{
Type = type;
Value = value;
}
public override string ToString() => $"{Type}: {Value}";
}
Code language: C# (cs)
Step 2: Implement the Tokenizer
Now, we create a Tokenizer
class that processes an input string and extracts tokens.
using System;
using System.Collections.Generic;
using System.Text;
public class Tokenizer
{
private readonly string _input;
private int _position;
public Tokenizer(string input)
{
_input = input;
_position = 0;
}
private char CurrentChar => _position < _input.Length ? _input[_position] : '\0';
private void Advance() => _position++;
public List<Token> Tokenize()
{
List<Token> tokens = new List<Token>();
while (_position < _input.Length)
{
char c = CurrentChar;
if (char.IsWhiteSpace(c))
{
tokens.Add(new Token(TokenType.Whitespace, c.ToString()));
Advance();
}
else if (char.IsLetter(c))
{
tokens.Add(ReadIdentifier());
}
else if (char.IsDigit(c))
{
tokens.Add(ReadNumber());
}
else if ("+-*/=;".Contains(c))
{
tokens.Add(new Token(TokenType.Operator, c.ToString()));
Advance();
}
else
{
tokens.Add(new Token(TokenType.Unknown, c.ToString()));
Advance();
}
}
return tokens;
}
private Token ReadIdentifier()
{
StringBuilder sb = new StringBuilder();
while (_position < _input.Length && char.IsLetterOrDigit(CurrentChar))
{
sb.Append(CurrentChar);
Advance();
}
return new Token(TokenType.Identifier, sb.ToString());
}
private Token ReadNumber()
{
StringBuilder sb = new StringBuilder();
while (_position < _input.Length && char.IsDigit(CurrentChar))
{
sb.Append(CurrentChar);
Advance();
}
return new Token(TokenType.Number, sb.ToString());
}
}
Code language: C# (cs)
4. Testing the Tokenizer
We can now test our tokenizer with a simple main function.
using System;
class Program
{
static void Main()
{
string input = "var x = 42 + 8;";
Tokenizer tokenizer = new Tokenizer(input);
var tokens = tokenizer.Tokenize();
foreach (var token in tokens)
{
Console.WriteLine(token);
}
}
}
Code language: C# (cs)
Expected Output
Identifier: var
Whitespace:
Identifier: x
Whitespace:
Operator: =
Whitespace:
Number: 42
Whitespace:
Operator: +
Whitespace:
Number: 8
Symbol: ;
Code language: C# (cs)
5. Optimizing the Tokenizer for Performance
While our tokenizer works well, we can improve its efficiency.
Optimization 1: Using a StringSpan Instead of Substrings
Instead of creating substrings, we can use ReadOnlySpan<char>
to process text more efficiently.
using System;
public ref struct TokenizerSpan
{
private ReadOnlySpan<char> _span;
private int _position;
public TokenizerSpan(ReadOnlySpan<char> span)
{
_span = span;
_position = 0;
}
public Token? NextToken()
{
if (_position >= _span.Length) return null;
char c = _span[_position];
if (char.IsWhiteSpace(c))
{
_position++;
return new Token(TokenType.Whitespace, c.ToString());
}
if (char.IsLetter(c))
{
int start = _position;
while (_position < _span.Length && char.IsLetterOrDigit(_span[_position]))
_position++;
return new Token(TokenType.Identifier, _span[start.._position].ToString());
}
if (char.IsDigit(c))
{
int start = _position;
while (_position < _span.Length && char.IsDigit(_span[_position]))
_position++;
return new Token(TokenType.Number, _span[start.._position].ToString());
}
if ("+-*/=;".Contains(c))
{
_position++;
return new Token(TokenType.Operator, c.ToString());
}
_position++;
return new Token(TokenType.Unknown, c.ToString());
}
}
Code language: C# (cs)
This improves performance by avoiding excessive memory allocations.
6. Expanding the Tokenizer: Handling Strings and Comments
To support more advanced parsing, we can extend our tokenizer to recognize:
- String literals:
"Hello World"
- Comments:
// This is a comment
Extending the Tokenizer
private Token ReadString()
{
StringBuilder sb = new StringBuilder();
Advance(); // Skip opening quote
while (_position < _input.Length && CurrentChar != '"')
{
sb.Append(CurrentChar);
Advance();
}
Advance(); // Skip closing quote
return new Token(TokenType.Identifier, sb.ToString());
}
private Token ReadComment()
{
StringBuilder sb = new StringBuilder();
while (_position < _input.Length && CurrentChar != '\n')
{
sb.Append(CurrentChar);
Advance();
}
return new Token(TokenType.Identifier, sb.ToString());
}
Code language: C# (cs)
7. Conclusion
In this tutorial, we explored how to build an efficient tokenizer in C#. We started with a basic implementation and optimized it using Span<char>
. We also extended it to support string literals and comments.
By following this approach, you can build highly efficient text parsers for different use cases like programming language interpreters, data processing pipelines, or log file analyzers.
What’s next? Try enhancing this tokenizer by adding support for floating-point numbers, multi-character operators (like >=
or !=
), and function call parsing!