Regex Precompilation in Practice

From the previous post on this subject, I thought I would explain in detail how to put this into use in a project.  We implemented this in Community Server for the 2008 release, and we saw some great improvement in our startup times.  We had well over 100 regular expressions throughout our codebase.  Of course, not all of them were either static or set to compile (where appropriate, of course) so there was a little work to be done in those areas.

Our process consists of an XML file to describe the regular expressions, namespaces and output locations and a stand alone executable to generate the assembly and wrapper class to call.  Remember in the previous article I said that I'd want a thread safe way to get at the instance? That is what the wrapper is for.  I'll show the basics of what we are doing and provide a sample framework so you can do the same thing in your projects.

A sample of the XML definition file:

<?xml version="1.0" encoding="utf-8" ?>
<RegEx namespace="RegexLibrary">
    <Configuration>
        <ASSEMBLY_OUTNAME>RegexLibrary</ASSEMBLY_OUTNAME>
        <CS_OUTNAME>YourNamespace.Components</CS_OUTNAME>
        <CS_OUTNAME_FILE>YNSRegex</CS_OUTNAME_FILE>
        <ASSEMBLY_FOLDERPATH>.\\outfile\\</ASSEMBLY_FOLDERPATH>
        <CS_FOLDERPATH>.\\outfile\\</CS_FOLDERPATH>
    </Configuration>

    <Item id="Spacer" description="DESCRIPTION" options="NONE">
        <![CDATA[\s{2,}]]>
    </Item>
</RegEx>

ASSEMBLY_OUTNAME is the name of the assembly that is generated.  In this case, RegexLibrary.dll
CS_OUTNAME is the namespace that you want your regular expressions to be contained in
CS_OUTNAME_FILE is the file name of your wrapper file.  This is also the class name for your regular expressions
ASSEMBLY_FOLDERPATH is the destination of the assembly
CS_FOLDERPATH is the path of the wrapper file

The wrapper file provides a similar structure that I provided in the previous post on this topic:

using System;
using System.Text.RegularExpressions;
using RL = YourCodebase.RegexLibrary;

namespace YourCodebase.Components
{
   
public class CSRegex
   
{
       
private static readonly object locker = new object();

       
static CSRegex() {}
       
private CSRegex() {}

       
private static Regex __Spacer;
       
public static Regex SpacerRegex()
        {
           
lock(locker)
            {
               
if( __Spacer == null )
                {
                    __Spacer =
new RL.Spacer();
                }
               
return __Spacer;
            }
        }
    }
}

Now for the method that does the real work.  Note that we are using an extended class for the regular expression data.  This is to provide additional information for the wrapper file.

static void Build()
{
    string assemblyInitialPath = Path.Combine(Environment.CurrentDirectory, ASSEMBLY_OUTNAME + ".dll");
   
string assemblyFullPath = Path.Combine(Path.Combine(Environment.CurrentDirectory, ASSEMBLY_FOLDERPATH), ASSEMBLY_OUTNAME + ".dll");
   
string outFileFullPath = Path.Combine(Path.Combine(Environment.CurrentDirectory, CS_FOLDERPATH), CS_OUTNAME_FILE + ".cs");

    assemblyFullPath =
Path.GetFullPath(assemblyFullPath);
    outFileFullPath =
Path.GetFullPath(outFileFullPath);

   
List<RegexCompilationInfoExt> lrci = GetRegexInfo(ASSEMBLY_OUTNAME);
   
Console.WriteLine( "Total of {0} regular expressions will be created.", lrci.Count );

   
if ( File.Exists( assemblyFullPath ) )
       
File.Delete( assemblyFullPath );

   
// build the assembly
   
Regex.CompileToAssembly(lrci.ToArray(), new AssemblyName(ASSEMBLY_OUTNAME));

   
// copy assembly
   
File.Copy(assemblyInitialPath, assemblyFullPath, true);

   
// build an outfile
   
OutFile(outFileFullPath, lrci);
}

Now, for our environment each of these options were necessary.  I don't expect that everyone would want or need all of this, but the gist of this exercise is to be concerned about what you are initializing in your [static] classes. Any of them.  If it is something that should probably be lazy loaded, then make it happen.  Employ the Singleton pattern when appropriate.  Note that this process [of precompilation of regular expressions] should be used prior to your main build (i.e. as a reference in your main project).

If your project does only have a few regular expressions, by all means, just use the RegexOptions.Compiled and don't do the precompilation (but at least use the lazy-loading).  But consider that when your project grows in complexity, your needs will grow with it.  Consider then when designing your next large scale project.

[ Download a sample project for building these files ]

No Comments

  1. There are no comments yet...Kick things off by filling out the form below.


Leave a Reply