Spell Checking Your Code Using Roslyn

Tags: Code Review, roslyn

Many coding standards define that the code must be written in English, using English naming. Even though the vast majority of developers speak English, if you are not in a country where English is a native language, they are bound to add some odd names to classes, methods, properties, etc. With the help of Roslyn and a dictionary it is possible to scan through your code and find potentially misspelled names.

I use the NHunspell library to perform the actual spell checking. It handles all the interaction with dictionary files through a simple API. All you need to provide is a dictionary file that you can use for looking up names. OpenOffice gives those away for free here, so pick up the language you want to check for. I just use English, because it is the standard. The dictionary files are .oxt files, which are just zip files, with all the necessary files inside. So you can simply extract the necessary files from your .oxt dictionary using a zip reader like DotNetZip. The following code shows how to set up your spell checker:

using (var dictFile = ZipFile.Read(@"Dictionaries\dict-en.oxt"))
{
    var affStream = new MemoryStream();
    var dicStream = new MemoryStream();
    dictFile.First(z => z.FileName == "en_US.aff").Extract(affStream);
    dictFile.First(z => z.FileName == "en_US.dic").Extract(dicStream);
    return new Hunspell(affStream.ToArray(), dicStream.ToArray());
}

Now that you have your Hunspell checker set up, you can simply throw all the words you want at it and see if it checks out.

    HUnspell checker = // Create as above
    bool isSpelledCorrectly = checker.Spell("word");

So, spell checking is simple enough. Now you just need to trawl through your code and check the name declarations. To do this, you can make use of the SyntaxVisitor class from Roslyn. Using the visitor, you can check all declarations which you want to be spelled correctly. Typically this would be classes, methods and properties, so you would need to override the appropriate visit methods. For example:

class SpellingVisitor : SyntaxWalker
{
	private static readonly Regex CapitalRegex = new Regex("[A-Z]", RegexOptions.Compiled);
	private static readonly string[] _knownWordList = new string[0];
	private NHunspell.Hunspell _checker;

	public override void VisitMethodDeclaration(MethodDeclarationSyntax node)
	{
		var name = node.Identifier.ValueText;
		if (! IsSpelledCorrectly(name))
		{
			// Do something with method name
		}
		base.VisitMethodDeclaration(node);
	}

	private bool IsSpelledCorrectly(string name)
	{
		return CapitalRegex.Replace(name, m => " " + m)
			.Trim()
			.Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries)
			.Where(s => !_knownWordList.Contains(s))
			.Aggregate(true, (b, s) => b && _checker.Spell(s));
	}
}

You will notice a couple of things in the code above. Apart from the actual spell checker, which I showed how to create above, the class also has a list of known words. Here it's empty, but I would expect you to have some words that you would use, but which might not be in the dictionary. The other thing is the regex. As method names are assumed to be Pascal cased, you will need to split them to check the individual parts. The regex's Replace method does this.

As you can see in the IsSpelledCorrectly the method name parts are checked, and if they all pass, the name is approved. If not, then you can do something with the name.

The example above overrides the VisitMethodDeclaration method, but you can also override other declarations.

Checking comments as well

Most of the attention given to Roslyn is focused on semantic analysis and syntax trees. But Roslyn gives you access much more than that, such as SyntaxTrivia. Trivia are all the things around your code, like whitespace and line breaks. When people say that in C# whitespace is not significant, then it is not true. If you drill through your syntax trees, you fill find a significant amount of trivia.

Comments are a kind of trivia, which you can also spell check. This will either reveal commented out code (which we all know is a lame replacement for version control, but still use), or some incoherent words. Checking your comments is a good way of making sure that they are actually useful.

Latest Tweets