This is a follow up post to two earlier posts on module organization and cyclic dependencies.
I thought it would be interesting to look at some real projects written in C# and F#, and see how they compare in modularity and number of cyclic dependencies.
My plan was to take ten or so projects written in C# and ten or so projects written in F#, and somehow compare them.
I didn't want to spend too much time on this, and so rather than trying to analyze the source files, I thought I would cheat a little and analyze the compiled assemblies, using the Mono.Cecil library.
This also meant that I could get the binaries directly, using NuGet.
The projects I picked were:
C# projects
F# projects
Unfortunately, there is not yet a wide variety of F# projects to choose from. I picked the following:
I did choose SpecFlow and TickSpec as being directly comparable, and also YamlDotNet and FsYaml.
But as you can see, most of the F# projects are not directly comparable to the C# ones. For example, there is no direct F# equivalent to Nancy, or Entity Framework.
Nevertheless, I was hoping that I might observe some sort of pattern by comparing the projects. And I was right. Read on for the results!
I wanted to examine two things: "modularity" and "cyclic dependencies".
First, what should be the unit of "modularity"?
From a coding point of view, we generally work with files (Smalltalk being a notable exception), and so it makes sense to think of the file as the unit of modularity. A file is used to group related items together, and if two chunks of code are in different files, they are somehow not as "related" as if they were in the same file.
In C#, the best practice is to have one class per file. So 20 files means 20 classes. Sometimes classes have nested classes, but with rare exceptions, the nested class is in the same file as the parent class. This means that we can ignore them and just use top-level classes as our unit of modularity, as a proxy for files.
In F#, the best practice is to have one module per file (or sometimes more). So 20 files means 20 modules. Behind the scenes, modules are turned into static classes, and any classes defined within the module are turned into nested classes. So again, this means that we can ignore nested classes and just use top-level classes as our unit of modularity.
The C# and F# compilers generate many "hidden" types, for things such as LINQ, lambdas, etc. In some cases, I wanted to exclude these, and only include "authored" types, which have been coded for explicitly. The compiler generated classes generally contain a special character such as <
or $
, so they are easy to detect. I didn't try to exclude the F# sum types from authored classes though.
So my definition of a top-level type is: a type that is not nested and which is not compiler generated.
The metrics I chose for modularity were:
Once we have our units of modularity, we can look at dependencies between modules.
For this analysis, I only want to include dependencies between types in the same assembly. In other words, dependencies on system types such as String
or List
do not count as a dependency.
Let's say we have a top-level type A
and another top-level type B
. Then I say that a dependency exists from A
to B
if:
A
or any of its nested types inherits from (or implements) type B
or any of its nested types.A
or any of its nested types has a field, property or method that references type B
or any of its nested types as a parameter or return value. This includes private members as well -- after all, it is still a dependency.A
or any of its nested types has a method implementation that references type B
or any of its nested types.This might not be a perfect definition. But it is good enough for my purposes.
In addition to all dependencies, I thought it might be useful to look at "public" or "published" dependencies. A public dependency from A
to B
exists if:
A
or any of its nested types inherits from (or implements) type B
or any of its nested types.A
or any of its nested types has a public field, property or method that references type B
or any of its nested types as a parameter or return value.The metrics I chose for dependencies were:
Given this definition of dependency, then, a cyclic dependency occurs when two different top-level types depend on each other.
Note what not included in this definition. If a nested type in a module depends on another nested type in the same module, then that is not a cyclic dependency.
If there is a cyclic dependency, then there is a set of modules that are all linked together. For example, if A
depends on B
, B
depends on C
, and then say, C
depends on A
, then A
, B
and C
are linked together. In graph theory, this is called a strongly connected component.
The metrics I chose for cyclic dependencies were:
I analyzed cyclic dependencies for all dependencies and also for public dependencies only.
First, I downloaded each of the project binaries using NuGet. Then I wrote a little F# script that did the following steps for each assembly:
This dependency list was then used to extract various statistics, shown below. I also rendered the dependency graphs to SVG format (using graphViz).
For cycle detection, I used the QuickGraph library to extract the strongly connected components, and then did some more processing and rendering.
If you want the gory details, here is a link to the script that I used, and here is the raw data.
It is important to recognize that this is not a proper statistical study, just a quick analysis. However the results are quite interesting, as we shall see.
Let's look at the modularity first.
Here are the modularity-related results for the C# projects:
Project | Code size | Top-level types | Authored types | All types | Code/Top | Code/Auth | Code/All | Auth/Top | All/Top |
---|---|---|---|---|---|---|---|---|---|
ef | 269521 | 517 | 568 | 879 | 521 | 475 | 307 | 1.10 | 1.70 |
jsonDotNet | 148829 | 215 | 232 | 283 | 692 | 642 | 526 | 1.08 | 1.32 |
nancy | 143445 | 339 | 366 | 560 | 423 | 392 | 256 | 1.08 | 1.65 |
cecil | 101121 | 240 | 245 | 247 | 421 | 413 | 409 | 1.02 | 1.03 |
nuget | 114856 | 218 | 239 | 383 | 527 | 481 | 300 | 1.10 | 1.76 |
signalR | 65513 | 193 | 230 | 312 | 339 | 285 | 210 | 1.19 | 1.62 |
nunit | 45023 | 173 | 195 | 197 | 260 | 231 | 229 | 1.13 | 1.14 |
specFlow | 46065 | 242 | 287 | 331 | 190 | 161 | 139 | 1.19 | 1.37 |
elmah | 43855 | 116 | 140 | 141 | 378 | 313 | 311 | 1.21 | 1.22 |
yamlDotNet | 23499 | 70 | 73 | 73 | 336 | 322 | 322 | 1.04 | 1.04 |
TOTAL | 1001727 | 2323 | 2575 | 3406 | 431 | 389 | 294 | 1.11 | 1.47 |
And here are the results for the F# projects:
Project | Code size | Top-level types | Authored types | All types | Code/Top | Code/Auth | Code/All | Auth/Top | All/Top |
---|---|---|---|---|---|---|---|---|---|
fsxCore | 339596 | 173 | 407 | 2024 | 1963 | 834 | 168 | 2.35 | 11.70 |
fsCore | 226830 | 154 | 348 | 1186 | 1473 | 652 | 191 | 2.26 | 7.70 |
fsPowerPack | 117581 | 93 | 162 | 410 | 1264 | 726 | 287 | 1.74 | 4.41 |
storm | 73595 | 67 | 78 | 405 | 1098 | 944 | 182 | 1.16 | 6.04 |
fsParsec | 67252 | 8 | 27 | 245 | 8407 | 2491 | 274 | 3.38 | 30.63 |
websharper | 47391 | 52 | 129 | 285 | 911 | 367 | 166 | 2.48 | 5.48 |
tickSpec | 30797 | 34 | 53 | 170 | 906 | 581 | 181 | 1.56 | 5.00 |
websharperHtml | 14787 | 18 | 28 | 72 | 822 | 528 | 205 | 1.56 | 4.00 |
canopy | 15105 | 6 | 17 | 103 | 2518 | 889 | 147 | 2.83 | 17.17 |
fsYaml | 15191 | 7 | 14 | 160 | 2170 | 1085 | 95 | 2.00 | 22.86 |
fsSql | 15434 | 13 | 22 | 162 | 1187 | 702 | 95 | 1.69 | 12.46 |
fsUnit | 1848 | 2 | 3 | 7 | 924 | 616 | 264 | 1.50 | 3.50 |
TOTAL | 965407 | 627 | 1288 | 5229 | 1540 | 750 | 185 | 2.05 | 8.34 |
The columns are:
I have extended these core metrics with some extra calculated columns:
The first thing I noticed is that, with a few exceptions, the code size is much bigger for the C# projects than for the F# projects. Partly that is because I picked bigger projects, of course. But even for a somewhat comparable project like SpecFlow vs. TickSpec, the SpecFlow code size is bigger. It may well be that SpecFlow does a lot more than TickSpec, of course, but it also may be a result of using more generic code in F#. There is not enough information to know either way right now -- it would be interesting to do a true side by side comparison.
Next, the number of top-level types. I said earlier that this should correspond to the number of files in a project. Does it?
I didn't get all the sources for all the projects to do a thorough check, but I did a couple of spot checks. For example, for Nancy, there are 339 top level classes, which implies that there should be about 339 files. In fact, there are actually 322 .cs files, so not a bad estimate.
On the other hand, for SpecFlow there are 242 top level types, but only 171 .cs files, so a bit of an overestimate there. And for Cecil, the same thing: 240 top level classes but only 128 .cs files.
For the FSharpX project, there are 173 top level classes, which implies there should be about 173 files. In fact, there are actually only 78 .fs files, so it is a serious over-estimate by a factor of more than 2. And if we look at Storm, there are 67 top level classes. In fact, there are actually only 35 .fs files, so again it is an over-estimate by a factor of 2.
So it looks like the number of top level classes is always an over-estimate of the number of files, but much more so for F# than for C#. It would be worth doing some more detailed analysis in this area.
The "Code/Top" ratio is consistently bigger for F# code than for C# code. Overall, the average top-level type in C# is converted into 431 instructions. But for F# that number is 1540 instructions, over three times as many.
I expect that this is because F# code is more concise than C# code. I would guess that 500 lines of F# code in a single module would create many more CIL instructions than 500 lines of C# code in a class.
If we visually plot "Code size" vs. "Top-level types", we get this chart:
What's surprising to me is how linear this chart is. The C# projects seem to have a consistent ratio of about 2.3 top-level types per 1000 instructions, even across different project sizes. And the F# projects are consistent too, having a ratio of about 0.6 top-level types per 1000 instructions.
The message I get from all this is that, for a given size of project, the F# version will have fewer modules, and presumably less complexity as a result.
On the other hand, if we compare the ratio of code to all types, including compiler generated ones, we get a very different result.
Here's the corresponding chart of "Code size" vs. "All types":
Again, this is surprisingly linear. The total number of types (including compiler generated ones) seems to depend closely on the size of the project.
The "size" of a type is somewhat smaller for F# code than for C# code. The average type in C# is converted into 294 instructions. But for F# that number is 185 instructions.
I'm not sure why this is. Is it because the F# types are more fine-grained, or could it be because the F# compiler generates many more little types than the C# compiler? Without doing a more subtle analysis, I can't tell.
Having compared the type counts to the code size, let's now compare them to each other:
This really brings out the difference. For each unit of modularity in C# there are an average of 1.11 authored types. But in F# the average is 2.05, and for some projects a lot more than that.
To me, this implies that the F# types are more fine-grained than the C# types.
Of course, creating nested types is trivial in F#, and quite uncommon in C#, so you could argue that this is not a fair comparison. But surely the ability to create a dozen types in as many lines of F# has some effect on the quality of the design? This is harder to do in C#, but there is nothing to stop you. So might this not mean that there is a temptation in C# to not be as fine-grained as you could potentially be?
Now let's look at the dependency relationships between the top level classes.
Here are the results for the C# projects:
Project | Top Level Types | Total Dep. Count | Dep/Top | One or more dep. | Three or more dep. | Five or more dep. | Ten or more dep. | Diagram |
---|---|---|---|---|---|---|---|---|
ef | 517 | 2462 | 4.8 | 78% | 52% | 32% | 14% | svg; dotfile |
jsonDotNet | 215 | 913 | 4.2 | 69% | 42% | 30% | 14% | svg; dotfile |
nancy | 339 | 1132 | 3.3 | 78% | 41% | 22% | 6% | svg; dotfile |
cecil | 240 | 1145 | 4.8 | 73% | 43% | 23% | 13% | svg; dotfile |
nuget | 218 | 875 | 4.0 | 72% | 43% | 28% | 13% | svg; dotfile |
signalR | 193 | 664 | 3.4 | 67% | 34% | 20% | 10% | svg; dotfile |
nunit | 173 | 499 | 2.9 | 75% | 39% | 13% | 4% | svg; dotfile |
specFlow | 242 | 578 | 2.4 | 64% | 25% | 17% | 5% | svg; dotfile |
elmah | 116 | 300 | 2.6 | 72% | 28% | 22% | 6% | svg; dotfile |
yamlDotNet | 70 | 228 | 3.3 | 83% | 30% | 11% | 4% | svg; dotfile |
TOTAL | 2323 | 8796 | 3.8 | 73% | 40% | 24% | 10% |
And here are the results for the F# projects:
Project | Top Level Types | Total Dep. Count | Dep/Top | One or more dep. | Three or more dep. | Five or more dep. | Ten or more dep. | Diagram |
---|---|---|---|---|---|---|---|---|
fsxCore | 173 | 76 | 0.4 | 30% | 4% | 1% | 0% | svg; dotfile |
fsCore | 154 | 287 | 1.9 | 55% | 26% | 14% | 3% | svg; dotfile |
fsPowerPack | 93 | 68 | 0.7 | 38% | 13% | 2% | 0% | svg; dotfile |
storm | 67 | 195 | 2.9 | 72% | 40% | 18% | 4% | svg; dotfile |
fsParsec | 8 | 9 | 1.1 | 63% | 25% | 0% | 0% | svg; dotfile |
websharper | 52 | 18 | 0.3 | 31% | 0% | 0% | 0% | svg; dotfile |
tickSpec | 34 | 48 | 1.4 | 50% | 15% | 9% | 3% | svg; dotfile |
websharperHtml | 18 | 37 | 2.1 | 78% | 39% | 6% | 0% | svg; dotfile |
canopy | 6 | 8 | 1.3 | 50% | 33% | 0% | 0% | svg; dotfile |
fsYaml | 7 | 10 | 1.4 | 71% | 14% | 0% | 0% | svg; dotfile |
fsSql | 13 | 14 | 1.1 | 54% | 8% | 8% | 0% | svg; dotfile |
fsUnit | 2 | 0 | 0.0 | 0% | 0% | 0% | 0% | svg; dotfile |
TOTAL | 627 | 770 | 1.2 | 46% | 17% | 7% | 1% |
The columns are:
The diagram column contains a link to a SVG file, generated from the dependencies, and also the DOT file that was used to generate the SVG. See below for a discussion of these diagrams.
These results are very interesting. For C#, the number of total dependencies increases with project size. Each top-level type depends on 3-4 others, on average.
On the other hand, the number of total dependencies in an F# project does not seem to vary too much with project size at all. Excluding Storm, each F# module depends on no more than 1-2 others, on average. And the largest project (FSharpX) has a lower ratio than many of the smaller projects. The Storm project is an exception, presumably because it has user interface code (the menu screens have dependencies on many other types).
Here's a chart of the relationship between code size and the number of dependencies:
The disparity between C# and F# projects is very clear. The C# dependencies seem to grow linearly with project size, while the F# dependencies seem to be flat.
The average number of dependencies per top level type is interesting, but it doesn't help us understand the variability. Are there many modules with lots of dependencies? Or does each one just have a few?
This might make a difference in maintainability, perhaps. I would assume that a module with only one or two dependencies would be easier to understand in the context of the application that one with tens of dependencies.
Rather than doing a sophisticated statistical analysis, I thought I would keep it simple and just count how many top level types had one or more dependencies, three or more dependencies, and so on.
Here are the same results, displayed visually:
So, what can we deduce from these numbers?
First, in the F# projects, more than half of the modules have no outside dependencies at all. This is a bit surprising. You might think that it is because most of them are library projects, rather than applications, but then, so are most of the C# projects too.
Second, the modules in the F# projects consistently have fewer dependencies than the classes in the C# projects.
Finally, in the F# projects, modules with a high number of dependencies are quite rare -- less than 2% overall. But in the C# projects, 10% of classes have more than 10 dependencies on other classes.
It might be useful to look at the dependency diagrams now. These are SVG files, so you should be able to view them in your browser.
Note that most of these diagrams are very big -- so after you open them you will need to zoom out quite a bit in order to see anything!
Let's start by comparing the diagrams for SpecFlow and TickSpec.
Each diagram lists all the top-level types found in the project. If there is a dependency from one type to another, it is shown by an arrow. The dependencies point from left to right where possible, so any arrows going from right to left implies that there is a cyclic dependency.
The layout is done automatically by graphviz, but in general, the types are organized into columns or "ranks". For example, the SpecFlow diagram has 12 ranks, and the TickSpec diagram has five.
As you can see, there are generally a lot of tangled lines in a typical dependency diagram! How tangled the diagram looks is a sort of visual measure of the code complexity. For instance, if I was tasked to maintain the SpecFlow project, I wouldn't really feel comfortable until I had understood all the relationships between the classes. And the more complex the project, the longer it takes to come up to speed.
The TickSpec diagram is a lot simpler than the SpecFlow one. Is that because TickSpec perhaps doesn't do as much as SpecFlow?
The answer is no, I don't think that it has anything to do with the size of the feature set at all, but rather because the code is organized differently.
Looking at the SpecFlow classes (dotfile), we can see it follows good OOD and TDD practices by creating interfaces. There's a TestRunnerManager
and an ITestRunnerManager
, for example. And there are many other patterns that commonly crop up in OOD: "listener" classes and interfaces, "provider" classes and interfaces, "comparer" classes and interfaces, and so on.
But if we look at the TickSpec diagram (dotfile) there are no interfaces at all. And no "listeners", "providers" or "comparers" either. There might well be a need for such things in the code, but either they are not exposed outside their module, or more likely, the role they play is fulfilled by functions rather than types.
I'm not picking on the SpecFlow code, by the way. It seems well designed, and is a very useful library, but I think it does highlight some of the differences between OO design and functional design.
Let's also compare the diagrams for YamlDotNet and FParsec.
The FParsec diagram is tiny. There's more code in FParsec than in Yaml.Net, but there are only 9 dependencies in FParsec (you can even count them by hand!) compared with 228 in YamlDotNet.
Again, YamlDotNet might do more than FParsec in some ways, and it might not be fair to compare a hand-crafted parser with a generic combinator parser. But even so, you can't help feeling that there is something about F# which reduces the complexity of a project.
Finally, we can turn our attention to the oh-so-evil cyclic dependencies. (If you want to know why they are bad, read this post ).
Here are the cyclic dependency results for the C# projects.
Project | Top-level types | Max comp. size | Cycle count | Max comp. size (public) | Cycle count (public) | Diagram |
---|---|---|---|---|---|---|
ef | 517 | 278 | 2 | 7 | 1 | svg; dotfile |
jsonDotNet | 215 | 83 | 3 | 11 | 1 | svg; dotfile |
nancy | 339 | 21 | 6 | 2 | 2 | svg; dotfile |
cecil | 240 | 123 | 2 | 50 | 1 | svg; dotfile |
nuget | 218 | 10 | 4 | 1 | 0 | svg; dotfile |
signalR | 193 | 7 | 3 | 5 | 1 | svg; dotfile |
nunit | 173 | 78 | 2 | 48 | 1 | svg; dotfile |
specFlow | 242 | 3 | 5 | 2 | 1 | svg; dotfile |
elmah | 116 | 5 | 2 | 2 | 1 | svg; dotfile |
yamlDotNet | 70 | 1 | 0 | 1 | 0 |
And here are the results for the F# projects:
Project | Top-level types | Max comp. size | Cycle count | Max comp. size (public) | Cycle count (public) | Diagram |
---|---|---|---|---|---|---|
fsxCore | 173 | 1 | 0 | 1 | 0 | |
fsCore | 154 | 3 | 2 | 1 | 0 | svg; dotfile |
fsPowerPack | 93 | 2 | 1 | 1 | 0 | svg; dotfile |
storm | 67 | 1 | 0 | 1 | 0 | |
fsParsec | 8 | 1 | 0 | 1 | 0 | |
websharper | 52 | 1 | 0 | 0 | 0 | |
tickSpec | 34 | 1 | 0 | 1 | 0 | |
websharperHtml | 18 | 1 | 0 | 1 | 0 | |
canopy | 6 | 1 | 0 | 1 | 0 | |
fsYaml | 7 | 1 | 0 | 1 | 0 | |
fsSql | 13 | 1 | 0 | 1 | 0 | |
fsUnit | 2 | 0 | 0 | 0 | 0 |
The columns are:
If we are looking for cycles in the F# code, we will be sorely disappointed. Only two of the F# projects have cycles at all, and those are tiny. For example in FSharp.Core there is a mutual dependency between two types right next to each other in the same file, here.
On the other hand, almost all the C# projects have one or more cycles. Entity Framework and Cecil are the worst offenders, with complex cycles involving half of the classes in the entire project! Surprisingly, NuGet has very few, which makes me think that perhaps someone is using a code analysis tool such as NDepend.
Why the difference between C# and F#?
I started this analysis from curiosity -- was there any meaningful difference in the organization of C# and F# projects?
I was quite surprised that the distinction was so clear. Given these metrics, you could certainly predict which language the assembly was written in.
I don't claim that this analysis is perfect (and I hope haven't made a terrible mistake in the analysis code!) but I think that it could be a useful starting point for further investigation.