Processing 40 TB of code from 10M projects with a dedicated server and Go

The command line tool I created Sloc Cloc and Code (scc) (which is now modified and maintained by many other excellent people) counts lines of code, comments and makes a complexity estimate for files inside a directory. The latter is something you need a good sample size to make good use of. The way it works is that it counts branch statements in code. However what does that actually mean? For example “This file has a complexity of 10” is not very useful without some context. To solve this issue I thought I would try to run scc at all the source code I could get my hands on. This would also allow me to see if there are any edge cases I didn’t consider in the tool itself. A brute force Q/A trial by fire.

However if I am going to run it over all that code, which is going to be computationally expensive I may as well try to get some use out of it. As such I decided to record everything as I went and see if I could get something interesting in the end, hence this post.

In short I downloaded and processed a lot of code using scc. The raw numbers include,

  • 9,985,051 total repositories
  • 9,100,083 repositories with at least 1 identified file
  • 884,968 empty repositories (those with no files)
  • 58,389,641 files in all repositories
  • 40,736,530,379,778 bytes processed (40 TB)
  • 1,086,723,618,560 lines identified
  • 816,822,273,469 code lines identified
  • 124,382,152,510 blank lines identified
  • 145,519,192,581 comment lines identified
  • 71,884,867,919 complexity count according to scc rules
  • 2 new bugs raised in scc

Lets get the elephant out of the room first. It was not 10 million projects as the “click bait” title indicates. I was shy by 15,000 so I rounded up. Please forgive me.

It took about 5 weeks to download and run scc over the collection of repositories saving all of the data. It took just over 49 hours to crunch the 1 TB of JSON and produce the results below.

Methodology

Since I run searchcode.com I already have a collection of over 7,000,000 projects across git, mercurial, subversion and such. So why not try processing them? Working with git is usually the easiest solution so I ignored mercurial and subversion this time and exported the full list of git projects. Turns out I actually had 12 million git repositories being tracked, and I should probably update the page to reflect that.

So now I have 12 million or so git repositories which I need to download and process.

When you run scc you can choose to have it output the results in JSON optionally saving this file to disk like so scc --format json --output myfile.json main.go the results of which look like the following (done for a single file),

[ { "Blank": 115, "Bytes": 0, "Code": 423, "Comment": 30, "Complexity": 40, "Count": 1, "Files": [ { "Binary": false, "Blank": 115, "Bytes": 20396, "Callback": null, "Code": 423, "Comment": 30, "Complexity": 40, "Content": null, "Extension": "go", "Filename": "main.go", "Hash": null, "Language": "Go", "Lines": 568, "Location": "main.go", "PossibleLanguages": [ "Go" ], "WeightedComplexity": 0 } ], "Lines": 568, "Name": "Go", "WeightedComplexity": 0 } ]

As a larger example here are the results as JSON for the redis project, redis.json. All of the results below come from this output without any supporting additional data.

One thing to keep in mind is that scc generally categories languages based on extension (except where extension is shared such as Verilog and Coq). As such if someone puts a HTML file with a java extension it will be counted as a java file. Usually this isn’t a problem, because why would you ever do that? But of course at scale it is. It is something I discovered later where some files were masquerading as another.

A while back I wrote code to create github badges using scc https://boyter.org/posts/sloc-cloc-code-badges/ and since part of that included caching the results, I modified it slightly to cache the results as JSON in AWS S3.

With the badge code working in AWS using lambda, I took the exported list of projects, wrote about 15 lines of python to clean the format so it matched my lambda and make a request to it. I threw in some python multiprocessing to fork 32 processes to call the endpoint reasonably quickly.

This worked brilliantly. However the problem with the above was firstly the cost, and secondly lambda behind API-Gateway/ALB has a 30 second timeout, so it couldn’t process large repositories fast enough. I knew going in that this was not going to be the most cost effective solution but assuming it came close to $100 I would have been willing to live with it. After processing 1 million repositories I checked and the cost was about $60 and since I didn’t want a $700 AWS bill I decided to rethink my solution. Keep in mind that was mostly storage and CPU, or what was needed to collect this information. Assuming I processed or exported the data it was going to increase the cost considerably.

Since I was already in AWS the hip solution would be to dump the url’s as messages into SQS and pull from it using EC2 instances or fargate for processing. Then scale out like crazy. However despite working in AWS in my day job I have always believed in taco bell programming. Besides it was only 12 million repositories so I opted to implement a simpler (cheaper) solution.

Running this computation locally was out due to the abysmal state of the internet in Australia. However I do run searchcode.com fairly lean using dedicated servers from Hetzner. These boxes are quite powerful, i7 Quad Core 32 GB RAM machines often with 2 TB of disk space (usually unused). As such they usually have a lot of spare compute based on how I use them. The front-end varnish box for instance is doing the square root of zero most of the time. So why not run the processing there?

I didn’t quite taco bell program the solution using bash and gnu tools. What I did was write a simple Go program to spin up 32 go-routines which read from a channel, spawned git and scc subprocesses before writing the JSON output into S3. I actually wrote a Python solution at first, but having to install the pip dependencies on my clean varnish box seemed like a bad idea and it keep breaking in odd ways which I didn’t feel like debugging.

Running this on the box produced the following sort of metrics in htop, and the multiple git/scc processes running (scc is not visible in this screen capture) suggested that everything was working as expected, which I confirmed by looking at the results in S3.

scc-data process load

Presenting and Computing Results

Having recently read https://mattwarren.org/2017/10/12/Analysing-C-code-on-GitHub-with-BigQuery/ and https://psuter.net/2019/07/07/z-index I thought I would steal the format of those posts with regards to how I wanted to present the information. My twist on the previous however is to add jQuery DataTables over the large tables of information. This allows you to sort and search/filter results. As such you can click the headers to sort and use the search box to filter. The search box indicates that this is enabled for you. I also added a jump link near these tables so you can skip over them if you like.

The size of the data I needed to process raised another question. How does one process 10 million JSON files taking up just over 1 TB of disk space in an S3 bucket?

The first thought I had was AWS Athena. But since it’s going to cost something like $2.50 USD per query for that dataset I quickly looked for an alternative. That said if you kept the data there and processed it infrequently this might still work out to be the cheapest solution.

I posted the question on the company slack because why should I solve issues alone.

One idea raised was to dump the data into a large SQL database. However this means processing the data into the database, then running queries over it multiple times. Plus the structure of the data meant having a few tables which means foreign keys and indexes to ensure some level of performance. This feels wasteful because we could just process the data as we read it from disk in a single pass. I was also worried about building a database this large. With just data it would be over 1 TB in size before adding indexes.

Seeing as I produced the JSON using spare compute, I thought why not process the results the same way? Of course there is one issue with this. Pulling 1 TB of data out of S3 is going to cost a lot. In the event the program crashes that is going to be annoying. To reduce costs I wanted to pull all the files down locally and save them for further processing. Handy tip, you really do not want to store lots of little files on disk in a single directory. It sucks for runtime performance and file-systems don’t like it.

My answer to this was another simple go program to pull the files down from S3 then store them in a tar file. I could then process that file over and over. The process itself is done though very ugly go program to process the tar file so I could re-run my questions without having to trawl S3 over and over. I didn’t bother with go-routines for this code for two reasons. The first was that I didn’t want to max out my server, so this limits it to a single core for the hard CPU work (another to read the tar file was mostly blocked on the processor). The second being I didn’t want to ensure it was thread-safe.

With that done, what I needed was a collection of questions to answer. I used the slack brains trust again and crowd-sourced my work colleagues while I came up with some ideas of my own. The result of this mind meld is included below.

You can find all the code I used to process the JSON including that which pulled it down locally and the ugly python script I used to mangle it into something useful for this post https://github.com/boyter/scc-data Please don’t comment on it, I know the code is ugly and it is something I wrote as a throwaway as I am unlikely to ever look at it again.

If you do want to review code I have written to be read by others have a look at the source of scc.

Cost

I spent about $60 in compute while trialling lambda. I have not looked at the S3 storage cost yet but it should be close to $25 based on the size of the data. However this is not including the transfer costs which I also have not observed. Please note I cleared the bucket when I was finished with it so this is not an ongoing cost for me.

However after time I chose not to use AWS in the end because of cost. So what’s the real cost assuming I wanted to do it again?

Well to start all the software used is free as in freedom and open source. So nothing to worry about there.

In my case the cost would be free as I used “free compute” left over from searchcode. Not everyone has compute lying around however. So lets assume another person wishes to replicate this and as such needs to get a server.

It could be done for €73 using the cheapest new dedicated server from Hetzner https://www.hetzner.com/dedicated-rootserver However that cost includes a new server setup fee. If you are willing to wait and poke around on their auction house https://www.hetzner.com/sb you can find much cheaper servers with no setup fee at all. At time of writing I found the below machine which would be perfect for this project and is €25.21 a month with no setup fee.

hetzner server

Best part though? You can get the VAT removed if you are outside the EU. So give yourself an additional 10% discount on top if you are in this situation.

So were someone to do this from scratch using the same method I eventually went with it would cost under $100 USD to redo the same calculations, and more likely under $50 if you are a little patient or lucky. This also assumes you use the server for less than 2 months which is enough time to download and process. This also includes enough time for you to get a list of 10 million repositories to consider processing as well.

If I were to use a gzipped tar file in my analysis (which isn’t that hard to do really) I could even do 10x the repositories on the same machine as the resulting file would still be small enough to fit on the same hard disk. That would take longer to download though which is going to increase the cost for each additional month and this might take multiple months to do.

Going much larger then 100 million repositories however is going to require some level of sharding. Still it is safe to say that you could redo the entire process I did or larger one on the same hardware without much effort or code changes.

Data Sources

From the three sources, github, bitbucket and gitlab how many projects came from each? Note that this is counted before excluding empty repositories hence the sum is over the number of repositories that actually form the counts below this point.

sourcecount
github9,680,111
bitbucket248,217
gitlab56,722

Sorry to the GitHub/Bitbucket/GitLab teams if you read this. If this caused any issues for you (I doubt it) I will shout you a refreshing beverage of your choice should we ever meet.

How many files in a repository?

On to the real questions. Lets start with a simple one. How many files are in an average repository? Do most projects have a few files in them, or many? By looping over the repositories and counting the number of files we can then drop them in buckets of 1, 2, 10, 12 or however many files it has and plot it out.

scc-data files per project

The X-axis in this case being buckets of the count of files, and Y-axis being the count of projects with that many files. This is limited to projects with less than 1000 files because the plot looks like empty with a thin smear on the left side if you include all the outliers.

As it turns out most repositories have less than 200 files in them.

However what about plotting this by percentile, or more specifically by 95th percentile so its actually worth looking at? Turns out the vast majority 95% of projects have less than 1,000 files in them. While 90% of them have less than 300 files and 85% have less than 200.

scc-data files per project 95th

If you want to plot this yourself and do a better job than I here is a link to the raw data filesPerProject.json.

Whats the project breakdown per language?

This means for each project scanned if a Java file is identified increment the Java count by one and for the second file do nothing. This gives a quick view of what languages are most commonly used. Unsurprisingly the most common languages include markdown, .gitignore and plain text.

Markdown the most commonly used language in any project is included in just over 6 million projects which is about 23 of the entire project set. This makes sense since almost all projects include a README.md which is displayed in HTML for repository pages.

The full list is included below.

skip table to next section

languageproject count
Markdown6,041,849
gitignore5,471,254
Plain Text3,553,325
JavaScript3,408,921
HTML3,397,596
CSS3,037,754
License2,597,330
XML2,218,846
JSON1,903,569
YAML1,860,523
Python1,424,505
Shell1,395,199
Ruby1,386,599
Java1,319,091
C Header1,259,519
Makefile1,215,586
Rakefile1,006,022
PHP992,617
Properties File909,631
SVG804,946
C791,773
C++715,269
Batch645,442
Sass535,341
Autoconf505,347
Objective C503,932
CoffeeScript435,133
SQL413,739
Perl390,775
C#380,841
ReStructuredText356,922
MSBuild354,212
LESS281,286
CSV275,143
C++ Header199,245
CMake173,482
Patch169,078
Assembly165,587
XML Schema148,511
m4147,204
JavaServer Pages142,605
Vim Script134,156
Scala132,454
Objective C++127,797
Gradle126,899
Module-Definition120,181
Bazel114,842
R113,770
ASP.NET111,431
Go Template111,263
Document Type Definition109,710
Gherkin Specification107,187
Smarty Template106,668
Jade105,903
Happy105,631
Emacs Lisp105,620
Prolog102,792
Go99,093
Lua98,232
BASH95,931
D94,400
ActionScript93,066
TeX84,841
Powershell80,347
AWK79,870
Groovy75,796
LEX75,335
nuspec72,478
sed70,454
Puppet67,732
Org67,703
Clojure67,145
XAML65,135
TypeScript62,556
Systemd58,197
Haskell58,162
XCode Config57,173
Boo55,318
LaTeX55,093
Zsh55,044
Stylus54,412
Razor54,102
Handlebars51,893
Erlang49,475
HEX46,442
Protocol Buffers45,254
Mustache44,633
ASP43,114
Extensible Stylesheet Language Transformations42,664
Twig Template42,273
Processing41,277
Dockerfile39,664
Swig37,539
LD Script36,307
FORTRAN Legacy35,889
Scons35,373
Scheme34,982
Alex34,221
TCL33,766
Android Interface Definition Language33,000
Ruby HTML32,645
Device Tree31,918
Expect30,249
Cabal30,109
Unreal Script29,113
Pascal28,439
GLSL28,417
Intel HEX27,504
Alloy27,142
Freemarker Template26,456
IDL26,079
Visual Basic for Applications26,061
Macromedia eXtensible Markup Language24,949
F#24,373
Cython23,858
Jupyter23,577
Forth22,108
Visual Basic21,909
Lisp21,242
OCaml20,216
Rust19,286
Fish18,079
Monkey C17,753
Ada17,253
SAS17,031
Dart16,447
TypeScript Typings16,263
SystemVerilog15,541
Thrift15,390
C Shell14,904
Fragment Shader File14,572
Vertex Shader File14,312
QML13,709
ColdFusion13,441
Elixir12,716
Haxe12,404
Jinja12,274
JSX12,194
Specman e12,071
FORTRAN Modern11,460
PKGBUILD11,398
ignore11,287
Mako10,846
TOML10,444
SKILL10,048
AsciiDoc9,868
Swift9,679
BuildStream9,198
ColdFusion CFScript8,614
Stata8,296
Creole8,030
Basic7,751
V7,560
VHDL7,368
Julia7,070
ClojureScript7,018
Closure Template6,269
AutoHotKey5,938
Wolfram5,764
Docker ignore5,555
Korn Shell5,541
Arvo5,364
Coq5,068
SRecode Template5,019
Game Maker Language4,557
Nix4,216
Vala4,110
COBOL3,946
Varnish Configuration3,882
Kotlin3,683
Bitbake3,645
GDScript3,189
Standard ML (SML)3,143
Jenkins Buildfile2,822
Xtend2,791
ABAP2,381
Modula32,376
Nim2,273
Verilog2,013
Elm1,849
Brainfuck1,794
Ur/Web1,741
Opalang1,367
GN1,342
TaskPaper1,330
Ceylon1,265
Crystal1,259
Agda1,182
Vue1,139
LOLCODE1,101
Hamlet1,071
Robot Framework1,062
MUMPS940
Emacs Dev Env937
Cargo Lock905
Flow9839
Idris804
Julius765
Oz764
Q#695
Lucius627
Meson617
F*614
ATS492
PSL Assertion483
Bitbucket Pipeline418
PureScript370
Report Definition Language313
Isabelle296
JAI286
MQL4271
Ur/Web Project261
Alchemist250
Cassius213
Softbridge Basic207
MQL Header167
JSONL146
Lean104
Spice Netlist100
Madlang97
Luna91
Pony86
MQL546
Wren33
Just30
QCL27
Zig21
SPDX20
Futhark16
Dhall15
FIDL14
Bosque14
Janet13
Game Maker Project6
Polly6
Verilog Args File2

How many files in a repository per language?

An extension of the above, but averaged over however many files are in each language per repository. So for projects that contain java, how many java files exist in that project, and on average for all projects how many files is that?

You can use this to see if a project is larger or smaller than usual for your language of choice.

skip table to next section

languageaverage file count
ABAP1.0008927583699165
ASP1.6565139917314107
ASP.NET346.88867258489296
ATS7.888545610390882
AWK5.098807478952136
ActionScript15.682562363539644
Ada7.265376817272021
Agda1.2669381110755398
Alchemist7.437307493090622
Alex20.152479318023637
Alloy1.0000000894069672
Android Interface Definition Language3.1133707938643074
Arvo9.872687772928423
AsciiDoc14.645389421879814
Assembly1049.6270518312476
AutoHotKey1.5361384288472488
Autoconf33.99728695464163
BASH3.7384110335355545
Basic5.103623499110781
Batch3.943513588378872
Bazel1.0013122734382187
Bitbake1.0878349272366024
Bitbucket Pipeline1
Boo5.321822367969364
Bosque1.28173828125
Brainfuck1.3141119785974242
BuildStream1.4704635441667189
C15610.17972307699
C Header14103.33936083782
C Shell3.1231084093649315
C#45.804460355773394
C++30.416980313492328
C++ Header8.313450764990089
CMake37.2566873554469
COBOL3.129408853490878
CSS5.332398714337156
CSV8.370432089241898
Cabal1.0078125149013983
Cargo Lock1.0026407549221519
Cassius4.657169356495984
Ceylon7.397692655679642
Clojure8.702303821528872
ClojureScript5.384518778099244
Closure Template1.0210028022356945
CoffeeScript45.40906609668401
ColdFusion13.611857060674573
ColdFusion CFScript40.42554202020521
Coq10.903652047164622
Creole1.000122070313864
Crystal3.8729367926098117
Cython1.9811811237515262
D529.2562750397005
Dart1.5259554297822313
Device Tree586.4119588123021
Dhall5.072265625
Docker ignore1.0058596283197403
Dockerfile1.7570825852789156
Document Type Definition2.2977520758534693
Elixir8.916658446524252
Elm1.6702759813968946
Emacs Dev Env15.720268315288969
Emacs Lisp11.378847912292201
Erlang3.4764894379621607
Expect2.8863991651091614
Extensible Stylesheet Language Transformations1.2042068607534995
F#1.2856606249320954
F*32.784058919015
FIDL1.8441162109375
FORTRAN Legacy11.37801716560221
FORTRAN Modern27.408192558594685
Fish1.1282354207617833
Flow95.218046229973186
Forth10.64736177807574
Fragment Shader File3.648087980622546
Freemarker Template8.397226930409037
Futhark4.671875
GDScript3.6984173692608313
GLSL1.6749061330076334
GN1.0193083210608163
Game Maker Language3.6370866431049604
Game Maker Project1.625
Gherkin Specification60.430588516231666
Go115.23482489228113
Go Template28.011342078505013
Gradle5.628880473160033
Groovy6.697367294187844
HEX22.477003537989486
HTML4.822243456786672
Hamlet50.297887645777536
Handlebars36.60120978679127
Happy5.820573911044464
Haskell8.730027121836951
Haxe20.00590981880653
IDL79.38510300176867
Idris1.524684997890027
Intel HEX113.25178379632708
Isabelle1.8903018088753136
JAI1.4865150753259275
JSON6.507823973898348
JSONL1.003931049286678
JSX4.6359645801363465
Jade5.353279289700571
Janet1.0390625
Java118.86142228014006
JavaScript140.56079100796154
JavaServer Pages2.390251418283771
Jenkins Buildfile1.0000000000582077
Jinja4.574843152310338
Julia6.672268339671913
Julius2.2510109380818903
Jupyter13.480476117239338
Just1.736882857978344
Korn Shell1.5100887455636172
Kotlin3.9004723322169363
LD Script16.59996086864524
LESS39.6484785300563
LEX5.892075421476933
LOLCODE1.0381496530137617
LaTeX5.336103768010524
Lean1.6653789470747329
License5.593879701111845
Lisp33.15947937896521
Lua24.796117625764612
Lucius6.5742989471450155
Luna4.437807061133055
MQL Header13.515527575704464
MQL46.400151428436254
MQL546.489316522221515
MSBuild4.8321384193507875
MUMPS8.187699062741014
Macromedia eXtensible Markup Language2.1945287114300807
Madlang3.7857666909751373
Makefile1518.1769808494607
Mako3.410234685769436
Markdown45.687500000234245
Meson32.45071679724949
Modula31.1610784588847318
Module-Definition4.9327688042002595
Monkey C3.035163164383345
Mustache19.052714578803542
Nim1.202213335585401
Nix2.7291879559930488
OCaml3.7135029841909697
Objective C4.9795510788040005
Objective C++2.2285232767506264
Opalang1.9975597794742732
Org5.258117805392903
Oz22.250069644336204
PHP199.17870638869982
PKGBUILD7.50632295051949
PSL Assertion3.0736406530442473
Pascal90.55238627885495
Patch25.331829692384225
Perl27.46770444081142
Plain Text1119.2375825397799
Polly1
Pony3.173291031071342
Powershell6.629884642978543
Processing9.139907354078636
Prolog1.816763080890156
Properties File2.1801967863634255
Protocol Buffers2.0456253005879304
Puppet43.424491631161054
PureScript4.063801504037935
Python22.473917606983292
Q#5.712939431518483
QCL7.590678825974464
QML1.255201818986247
R2.3781868952970115
Rakefile14.856192677576413
Razor62.79058974450959
ReStructuredText11.63852408056825
Report Definition Language23.065085061465403
Robot Framework2.6260137148703535
Ruby554.0134362337432
Ruby HTML24.091116656979562
Rust2.3002003813895207
SAS1.0032075758254648
SKILL1.9229039972029645
SPDX2.457843780517578
SQL2.293643752864969
SRecode Template20.688193360975845
SVG4.616972531365432
Sass42.92418345584642
Scala1.5957851387966393
Scheme10.480490204644848
Scons2.1977062552968114
Shell41.88552208947577
Smarty Template6.90615223295527
Softbridge Basic22.218602385698304
Specman e2.719783829645327
Spice Netlist2.454830619852739
Standard ML (SML)3.7598713626650295
Stata2.832579915520368
Stylus7.903926412469745
Swift54.175594149331914
Swig2.3953681161240747
SystemVerilog7.120705494624247
Systemd80.83254275520476
TCL46.9378307136513
TOML1.0316491217260413
TaskPaper1.0005036006351133
TeX8.690789447558961
Thrift1.620168483240211
Twig Template18.33051814392764
TypeScript1.2610517452930048
TypeScript Typings2.3638072576034137
Unreal Script2.9971615019965148
Ur/Web3.420488425604595
Ur/Web Project1.8752869585517504
V1.8780624768245784
VHDL5.764059992075602
Vala42.22072166146626
Varnish Configuration1.9899734258599446
Verilog1.443359777332832
Verilog Args File25.5
Vertex Shader File2.4700152399875077
Vim Script3.2196359799822662
Visual Basic119.8397831247842
Visual Basic for Applications2.5806381264096503
Vue249.0557418123258
Wolfram1.462178856761796
Wren227.4526259500999
XAML2.6149608174399264
XCode Config6.979387911493798
XML146.10128153519918
XML Schema6.832042266604565
Xtend2.87054940757827
YAML6.170148717655746
Zig1.071681022644043
Zsh2.6295064863912088
gitignore6.878908416722053
ignore1.0210649380633772
m457.5969985568356
nuspec3.245791111381787
sed1.3985770380241234

How many lines of code are in a typical file per language?

I suppose you could also look at this as what languages on average have the largest files? Using the average/mean for this pushes the results out to stupidly high numbers. This is because projects such as sqlite.c which is included in many projects is joined from many files into one, but nobody ever works on that single large file (I hope!).

So I calculated this using the median value. Even so there are still some definitions with stupidly high numbers such as Bosque and JavaScript.

So I figured why not have both? I did one small change based on the suggestion of Darrell (Kablamo’s resident and most excellent data scientist) and modified the average value to ignore files over 5000 lines to remove the outliers.

skip table to next section

languagemean < 5000median
ABAP13936
ASP513170
ASP.NET315148
ATS9451,411
AWK431774
ActionScript9502,676
Ada1,17913
Agda46689
Alchemist1,0401,463
Alex479204
Alloy7266
Android Interface Definition Language119190
Arvo2571,508
AsciiDoc5191,724
Assembly993225
AutoHotKey36023
Autoconf495144
BASH42526
Basic476847
Batch178208
Bazel22620
Bitbake43610
Bitbucket Pipeline1913
Boo898924
Bosque58199,238
Brainfuck141177
BuildStream1,9552,384
C1,0525,774
C Header869126,460
C Shell12877
C#1,2151,138
C++1,166232
C++ Header838125
CMake75015
COBOL42224
CSS729103
CSV41112
Cabal11613
Cargo Lock814686
Cassius124634
Ceylon20715
Clojure52119
ClojureScript504195
Closure Template34375
CoffeeScript342168
ColdFusion6865
ColdFusion CFScript1,2311,829
Coq56029,250
Creole8520
Crystal973119
Cython8531,738
D39710
Dart583500
Device Tree73944,002
Dhall12499
Docker ignore102
Dockerfile7617
Document Type Definition5221,202
Elixir402192
Elm438121
Emacs Dev Env646755
Emacs Lisp65315
Erlang930203
Expect419195
Extensible Stylesheet Language Transformations442600
F#38464
F*33565
FIDL6551,502
FORTRAN Legacy2771,925
FORTRAN Modern636244
Fish16874
Flow936832
Forth25662
Fragment Shader File30911
Freemarker Template52220
Futhark175257
GDScript4011
GLSL38029
GN9508,866
Game Maker Language710516
Game Maker Project1,290374
Gherkin Specification5162,386
Go780558
Go Template41125,342
Gradle22822
Groovy73413
HEX1,00217,208
HTML5561,814
Hamlet22070
Handlebars5063,162
Happy1,6170
Haskell65617
Haxe8659,607
IDL386210
Idris28542
Intel HEX1,256106,650
Isabelle7921,736
JAI26841
JSON28939
JSONL432
JSX39324
Jade299192
Janet50832
Java1,165697
JavaScript89473,979
JavaServer Pages644924
Jenkins Buildfile796
Jinja4653,914
Julia5391,031
Julius11312
Jupyter1,361688
Just6272
Korn Shell427776
Kotlin554169
LD Script521439
LESS1,08617
LEX1,014214
LOLCODE1294
LaTeX8957,482
Lean1819
License26620
Lisp7461,201
Lua820559
Lucius284445
Luna8548
MQL Header79310,337
MQL47993,168
MQL5384631
MSBuild558160
MUMPS92498,191
Macromedia eXtensible Markup Language50020
Madlang368340
Makefile30920
Mako269243
Markdown20610
Meson546205
Modula316217
Module-Definition4897
Monkey C14028
Mustache2988,083
Nim3523
Nix24078
OCaml71868
Objective C1,11117,103
Objective C++903244
Opalang15129
Org52324
Oz3607,132
PHP96414,660
PKGBUILD13119
PSL Assertion149108
Pascal1,044497
Patch67612
Perl76211
Plain Text352841
Polly1226
Pony33842,488
Powershell652199
Processing800903
Prolog2826
Properties File18418
Protocol Buffers5768,080
Puppet499660
PureScript598363
Python879258
Q#4755,417
QCL5483
QML8156,067
R56620
Rakefile1227
Razor7131,842
ReStructuredText7355,049
Report Definition Language1,38934,337
Robot Framework292115
Ruby7394,942
Ruby HTML326192
Rust1,0074
SAS23365
SKILL526123
SPDX1,242379
SQL466143
SRecode Template796534
SVG7961,538
Sass68214,653
Scala612661
Scheme5666
Scons5456,042
Shell3044
Smarty Template39215
Softbridge Basic2,0673
Specman e1270
Spice Netlist9061,465
Standard ML (SML)47875
Stata20012
Stylus505214
Swift683663
Swig1,0314,540
SystemVerilog563830
Systemd12726
TCL77442,396
TOML10017
TaskPaper377
TeX804905
Thrift545329
Twig Template7139,907
TypeScript46110
TypeScript Typings1,465236,866
Unreal Script795927
Ur/Web429848
Ur/Web Project3326
V7045,711
VHDL9521,452
Vala6032
Varnish Configuration20377
Verilog1982
Verilog Args File456481
Vertex Shader File16874
Vim Script55525
Visual Basic7381,050
Visual Basic for Applications979936
Vue732242
Wolfram940973
Wren358279,258
XAML70324
XCode Config20011
XML6051,033
XML Schema1,008248
Xtend710120
YAML16547,327
Zig188724
Zsh3009
gitignore333
ignore62
m4959807
nuspec187193
sed8233

Average complexity for file in each language?

What’s the average complexity per file for each language?

The complexity estimate isn’t really directly comparable between languages. Pulling from the README of scc itself

The complexity estimate is really just a number that is only comparable to files in the same language. It should not be used to compare languages directly without weighting them. The reason for this is that its calculated by looking for branch and loop statements in the code and incrementing a counter for that file.

So comparing languages to each other is not the idea here, although it may be comparable between similar languages such as Java and C for example. Your mileage may vary.

This is more useful when you think about it applying to single files in the same language. So you could answer the question “On average is the file I am working with more or less complex than average?”.

I should mention that I am always looking to improve this calculation and looking for submissions to scc to assist with this goal. Usually it is a case of just adding some keywords to the languages.json file so any programmer of any skill level should be able to assist with this.

skip table to next section

languagecomplexity
ABAP11.180740488380376
ASP11.536947250366211
ASP.NET2.149275320643484
ATS0.7621728432717677
AWK0
ActionScript22.088579905848178
Ada13.69141626294931
Agda0.19536590785719454
Alchemist0.3423442907696928
Alex0
Alloy6.9999997997656465
Android Interface Definition Language0
Arvo0
AsciiDoc0
Assembly1.5605608227976997
AutoHotKey423.87785756399626
Autoconf1.5524294972419739
BASH7.500000094871363
Basic1.0001350622574257
Batch1.4136352496767306
Bazel6.523681727119303
Bitbake0.00391388021490391
Bitbucket Pipeline0
Boo65.67764583729533
Bosque236.79837036132812
Brainfuck27.5516445041791
BuildStream0
C220.17236548200242
C Header0.027589923237434522
C Shell1.4911166269191476
C#1.0994400597744005
C++215.23628287682845
C++ Header2.2893104921677154
CMake0.887660006199008
COBOL0.018726348891789816
CSS6.317460331175305E-176
CSV0
Cabal3.6547924155738194
Cargo Lock0
Cassius0
Ceylon21.664400369259404
Clojure0.00009155273437716484
ClojureScript0.5347588658332859
Closure Template0.503426091716392
CoffeeScript0.02021490140137264
ColdFusion6.851776515250336
ColdFusion CFScript22.287403080299764
Coq3.3282556015266307
Creole0
Crystal1.6065794006138856
Cython42.87412906489837
D0
Dart2.1264450684815657
Device Tree0
Dhall0
Docker ignore0
Dockerfile6.158891172385556
Document Type Definition0
Elixir0.5000612735793482
Elm5.237952479502043
Emacs Dev Env1.2701271416728307E-61
Emacs Lisp0.19531250990197657
Erlang0.08028322620528387
Expect0.329944610851471
Extensible Stylesheet Language Transformations0
F#0.32300702900710193
F*9.403954876643223E-38
FIDL0.12695312593132269
FORTRAN Legacy0.8337643985574195
FORTRAN Modern7.5833590276411185
Fish1.3386242155247368
Flow934.5
Forth2.4664166555765066
Fragment Shader File0.0003388836600090293
Freemarker Template10.511094652522283
Futhark0.8057891242233386
GDScript10.750000000022537
GLSL0.6383056697891334
GN22.400601854287807
Game Maker Language4.709514207365569
Game Maker Project0
Gherkin Specification0.4085178437480328
Go50.06279203974034
Go Template2.3866690339840662E-153
Gradle0
Groovy3.2506868488244898
HEX0
HTML0
Hamlet0.25053861103978114
Handlebars1.6943764911351036E-21
Happy0
Haskell28.470107150053625
Haxe66.52873523714804
IDL7.450580598712868E-9
Idris17.77642903881352
Intel HEX0
Isabelle0.0014658546850726184
JAI7.749968137734008
JSON0
JSONL0
JSX0.3910405338329044
Jade0.6881713929215119
Janet0
Java297.22908150612085
JavaScript1.861130583340945
JavaServer Pages7.24235416213196
Jenkins Buildfile0
Jinja0.6118526458846931
Julia5.779676990326951
Julius3.7432448068125277
Jupyter0
Just1.625490248219907
Korn Shell11.085027896435056
Kotlin5.467347841779503
LD Script6.538079182471746E-26
LESS0
LEX0
LOLCODE5.980839657708373
LaTeX0
Lean0.0019872561135834133
License0
Lisp4.033602018074421
Lua44.70686769972825
Lucius0
Luna0
MQL Header82.8036524637758
MQL42.9989408299408566
MQL532.84198718928553
MSBuild2.9802322387695312E-8
MUMPS5.767955578948634E-17
Macromedia eXtensible Markup Language0
Madlang8.25
Makefile3.9272747722381812E-90
Mako0.007624773579836673
Markdown0
Meson0.3975182396400463
Modula30.7517121883916386
Module-Definition0.25000000023283153
Monkey C9.838715311259486
Mustache0.00004191328599945435
Nim0.04812580073302998
Nix25.500204694250044
OCaml16.92218069843716
Objective C65.08967337175548
Objective C++10.886891531550603
Opalang1.3724696160763994E-8
Org28.947825231747235
Oz6.260657086070324
PHP2.8314653639690874
PKGBUILD0
PSL Assertion0.5009768009185791
Pascal4
Patch0
Perl48.16959255514553
Plain Text0
Polly0
Pony4.91082763671875
Powershell0.43151378893449877
Processing9.691001653621564
Prolog0.5029296875147224
Properties File0
Protocol Buffers0.07128906529847256
Puppet0.16606500436341776
PureScript1.3008141816356456
Python11.510142201304832
Q#5.222080192729404
QCL13.195626304795667
QML0.3208023407643109
R0.40128818821921775
Rakefile2.75786388297917
Razor0.5298294073055322
ReStructuredText0
Report Definition Language0
Robot Framework0
Ruby7.8611656283491795
Ruby HTML1.3175727506823756
Rust8.62646485221385
SAS0.5223999023437882
SKILL0.4404907226562501
SPDX0
SQL0.00001537799835205078
SRecode Template0.18119949102401853
SVG1.7686873200833423E-74
Sass7.002974651049148E-113
Scala17.522343645163424
Scheme0.00003147125255509322
Scons25.56868253610655
Shell6.409446969197895
Smarty Template53.06143077491294
Softbridge Basic7.5
Specman e0.0639350358484781
Spice Netlist1.3684555315672042E-48
Standard ML (SML)24.686901116754818
Stata1.5115316917094068
Stylus0.3750006556512421
Swift0.5793484510104517
Swig0
SystemVerilog0.250593163372906
Systemd0
TCL96.5072605676113
TOML0.0048828125000002776
TaskPaper0
TeX54.0588040258797
Thrift0
Twig Template2.668124511961211
TypeScript9.191392608918255
TypeScript Typings6.1642456222327375
Unreal Script2.7333421227943004
Ur/Web16.51621568240534
Ur/Web Project0
V22.50230618938804
VHDL18.05495198571289
Vala147.2761703068509
Varnish Configuration0
Verilog5.582400367711671
Verilog Args File0
Vertex Shader File0.0010757446297590262
Vim Script2.4234658314493798
Visual Basic0.0004882812500167852
Visual Basic for Applications4.761343429454877
Vue0.7529517744621779
Wolfram0.0059204399585724215
Wren0.08593750013097715
XAML6.984919309616089E-10
XCode Config0
XML0
XML Schema0
Xtend2.8245844719990547
YAML0
Zig1.0158334437942358
Zsh1.81697392626756
gitignore0
ignore0
m40
nuspec0
sed22.91158285739948

What’s the average amount of comments used per file in each language?

You could probably rephrase this to asking what developers write the most comments assuming you squint enough.

skip table to next section

languagecomplexity
ABAP56.3020026683825
ASP24.67145299911499
ASP.NET9.140447860406259E-11
ATS41.89465025163305
AWK11.290069486393975
ActionScript31.3568633027012
Ada61.269572412982384
Agda2.4337660860304755
Alchemist2.232399710231226E-103
Alex0
Alloy0.000002207234501959681
Android Interface Definition Language26.984662160277367
Arvo0
AsciiDoc0
Assembly2.263919769706678E-72
AutoHotKey15.833985920534857
Autoconf0.47779749499136687
BASH34.15625059662068
Basic1.4219117348874069
Batch1.0430908205926455
Bazel71.21859817579139
Bitbake0.002480246487177871
Bitbucket Pipeline0.567799577547725
Boo5.03128187009327
Bosque0.125244140625
Brainfuck0
BuildStream12.84734197699206
C256.2839210573451
C Header184.88885430308878
C Shell5.8409870392823375
C#30.96563720101839
C++44.61584829131642
C++ Header27.578790410119197
CMake1.7564333047949374
COBOL0.7503204345703562
CSS4.998773531463529
CSV0
Cabal4.899812531420634
Cargo Lock0.0703125
Cassius0.07177734654413487
Ceylon3.6406326349824667
Clojure0.0987220821845421
ClojureScript0.6025725119252456
Closure Template17.078124673988057
CoffeeScript1.6345682790069884
ColdFusion33.745563628665096
ColdFusion CFScript13.566947396771592
Coq20.3222774725393
Creole0
Crystal6.0308081267588145
Cython21.0593019957583
D0
Dart4.634361584097128
Device Tree33.64898256434121
Dhall1.0053101042303751
Docker ignore8.003553375601768E-11
Dockerfile4.526245545632278
Document Type Definition0
Elixir8.0581139370409
Elm24.73191350743249
Emacs Dev Env2.74822998046875
Emacs Lisp12.168370702306452
Erlang16.670030919109056
Expect3.606161126133445
Extensible Stylesheet Language Transformations0
F#0.5029605040200058
F*5.33528354690743E-27
FIDL343.0418392068642
FORTRAN Legacy8.121405267242158
FORTRAN Modern171.32042583820953
Fish7.979248739519377
Flow90.5049991616979244
Forth0.7578125
Fragment Shader File0.2373057885016209
Freemarker Template62.250244379050855
Futhark0.014113984877253714
GDScript31.14457228694065
GLSL0.2182627061047912
GN17.443267241931284
Game Maker Language3.9815753922640824
Game Maker Project0
Gherkin Specification0.0032959059321794604
Go6.464829990599041
Go Template4.460169822267483E-251
Gradle0.5374194774415457
Groovy32.32068506016523
HEX0
HTML0.16671794164614084
Hamlet4.203293477836184E-24
Handlebars0.9389737429747177
Happy0
Haskell20.323476462551376
Haxe9.023509566990532
IDL1.01534495399968
Idris0.36279318680267497
Intel HEX0
Isabelle4.389802167076498
JAI2.220446049250313E-16
JSON0
JSONL0
JSX0.9860839844113964
Jade0.25000000000034117
Janet9.719207406044006
Java330.66188089718935
JavaScript22.102491285372537
JavaServer Pages4.31250095370342
Jenkins Buildfile0
Jinja2.5412145720173454E-50
Julia12.542627036271085
Julius0.24612165248208867
Jupyter0
Just0.3186038732601446
Korn Shell40.89005232702741
Kotlin0.3259347784770708
LD Script3.7613336386434204
LESS15.495439701029127
LEX55.277186392539086
LOLCODE13.578125958700468
LaTeX3.316717967334341
Lean21.194565176965895
License0
Lisp88.10676444837796
Lua76.67247973843406
Lucius0.3894241626790286
Luna16.844066019174637
MQL Header82.22436339969337
MQL41.957314499740677
MQL527.463183855085845
MSBuild0.19561428198176206
MUMPS5.960464477541773E-8
Macromedia eXtensible Markup Language0
Madlang6.75
Makefile1.2287070602578574
Mako1.3997604187154047E-8
Markdown0
Meson4.594536366188615
Modula33.4375390004645627
Module-Definition7.754887182446689
Monkey C0.02734480644075532
Mustache0.0000038370490074157715
Nim0.8432132130061808
Nix165.09375
OCaml27.238212826702338
Objective C32.250000004480256
Objective C++4.688333711547599
Opalang3.2498599900436704
Org2.4032862186444435
Oz11.531631554476924
PHP0.37573912739754056
PKGBUILD0
PSL Assertion4.470348358154297E-7
Pascal274.7797153576955
Patch0
Perl42.73014043490598
Plain Text0
Polly0
Pony0.2718505859375
Powershell2.0956492198317282
Processing11.358358417519032
Prolog6.93889390390723E-17
Properties File4.297774864451927
Protocol Buffers5.013992889700926
Puppet1.9962931947466012
PureScript6.608705271035433
Python15.208443286809963
Q#0.4281108849922295
QCL13.880147817629737
QML16.17036877582475
R5.355639399818855
Rakefile0.4253943361101697
Razor0.2500305203720927
ReStructuredText0
Report Definition Language1.8589575837924928E-119
Robot Framework0
Ruby8.696056880656087
Ruby HTML0.031281024218515086
Rust22.359375028118006
SAS0.7712382248290134
SKILL0.002197265625
SPDX0
SQL0.4963180149979617
SRecode Template17.64534428715706
SVG0.780306812508952
Sass1.6041624981030795
Scala2.7290137764062656
Scheme18.68675828842983
Scons9.985132321266597
Shell19.757167057040007
Smarty Template0.0009841919236350805
Softbridge Basic4.76177694441164E-25
Specman e0.1925095270881778
Spice Netlist5.29710110812646
Standard ML (SML)0.20708566564292288
Stata0.04904100534194722
Stylus4.534405773074049
Swift1.8627019961192913E-9
Swig11.786422730001505
SystemVerilog0.00009708851624323821
Systemd0
TCL382.839838598133
TOML0.37500173695180483
TaskPaper0
TeX8.266233975096164
Thrift50.53134153016524
Twig Template0
TypeScript8.250029131770134
TypeScript Typings37.89904005334354
Unreal Script46.13322029508541
Ur/Web0.04756343913582129
Ur/Web Project6.776263578034403E-21
V28.75797889154211
VHDL37.47892257625405
Vala74.26528331441615
Varnish Configuration19.45791923156868
Verilog4.165537942430622
Verilog Args File0
Vertex Shader File1.7979557178975683
Vim Script0
Visual Basic0.26300267116040704
Visual Basic for Applications0.3985138943535276
Vue5.039982162930666E-52
Wolfram70.01674025323683
Wren30694.003311276458
XAML0.5000169009533838
XCode Config13.653495818959595
XML3.533205032457776
XML Schema0
Xtend19.279739396268607
YAML1.1074293861154887
Zig0.507775428428431
Zsh6.769231127673729
gitignore1.3347179947709417E-20
ignore0.0356445312500015
m45.4183238737327075
nuspec3.640625
sed6.423678000929861

What are the most common filenames?

What filenames are most common across all code-bases ignoring extension and case?

Had you asked me before I started this I would have said, README, main, index, license. Thankfully the results reflect my thoughts pretty well. Although there are a lot of interesting ones in there. I have no idea why so many projects contain a file called 15 or s15.

The makefile being the most common surprised me a little, but then I remembered it is used in many new JavaScript projects. Another interesting thing to note is that it appears jQuery is still king and reports of its death are greatly exaggerated, with it appearing as #4 on the list.

file-namecount
makefile59,141,098
index33,962,093
readme22,964,539
jquery20,015,171
main12,308,009
package10,975,828
license10,441,647
__init__10,193,245
strings8,414,494
android7,915,225
config7,391,812
default5,563,255
build5,510,598
setup5,291,751
test5,282,106
irq4,914,052
154,295,032
country4,274,451
pom4,054,543
io3,642,747
system3,629,821
common3,629,698
gpio3,622,587
core3,571,098
module3,549,789
init3,378,919
dma3,301,536
bootstrap3,162,859
application3,000,210
time2,928,715
cmakelists2,907,539
plugin2,881,206
base2,805,340
s152,733,747
androidmanifest2,727,041
cache2,695,345
debug2,687,902
file2,629,406
app2,588,208
version2,580,288
assemblyinfo2,485,708
exception2,471,403
project2,432,361
util2,412,138
user2,343,408
clock2,283,091
timex2,280,225
pci2,231,228
style2,226,920
styles2,212,127

Note that due to memory constraints I made this process slightly lossy. Every 100 projects checked I would check the map and if an identified filename had < 10 counts it was dropped from the list. It could come back for the next run and if there was > 10 at this point it would remain. It shouldn’t happen that often but it is possible the counts may be out by some amount if some common name appeared sparsely in the first batch of repositories before becoming common. In short they are not absolute numbers but should be close enough.

I could have used a trie structure to “compress” the space and gotten absolute numbers for this, but I didn’t feel like writing one and just abused the map slightly to save enough memory and achieve my goal. I am however curious enough to try this out at a later date to see how a trie would perform.

How many repositories appear to be missing a license?

This is an interesting one. Which repositories have an explicit license file somewhere? Note that the lack of a license file here does not mean that the project has none, as it might exist within the README or be indicated through SPDX comment tags in-line. it just means that scc could not find an explicit license file using its own criteria which at time of writing means a file ignoring case named “license”, “licence”, “copying”, “copying3”, “unlicense”, “unlicence”, “license-mit”, “licence-mit” or “copyright”.

Sadly it appears that the vast majority of repositories are missing a license. I would argue that all software should have a license for a variety of reasons but here is someone else’s take on that.

has licensecount
no6,502,753
yes2,597,330

scc-data license count

How many projects use multiple .gitignore files?

Some may not know this but it is possible to have multiple .gitignore files in a git project. Given that fact how many projects use multiple .gitignore files? While we are looking how many have none?

What I did find that was interesting was one project that has 25,794 .gitignore files in its repository. The next highest was 2,547. I have no idea what is going on there. I had a brief look at it and it looks like they are used to allow checking in of the directories but I cannot confirm this.

Bringing this back to something sensible here is a plot of the data up to 20 .gitignore files and close to 99% of the total result.

scc-data process load

Something you would expect would be that the majority of projects would have either 0 or 1 .gitignore files. This is confirmed by the results with a massive drop-off of 10x for projects with 2 .gitignores. What was surprising to me was how many projects have more than a single .gitignore file. The long tail is especially long in this case.

I was also curious as to why some projects had thousands of .gitignore files. One of the main offenders appears to be forks of https://github.com/PhantomX/slackbuilds which all have ~2,547 .gitignore files. However the other repositories with 1000+ ignore files are listed below.

skip table to next section

.gitignore countproject count
03,628,829
14,576,435
2387,748
3136,641
479,808
548,336
633,686
733,408
822,571
916,453
1011,198
1110,070
128,194
137,701
145,040
154,320
165,905
174,156
184,542
193,828
202,706
212,449
221,975
232,255
242,060
251,768
262,886
272,648
282,690
291,949
301,677
313,348
321,176
33794
341,153
35845
36488
37627
38533
39502
40398
41370
42355
431,002
44265
45262
46295
47178
48384
49270
50189
51435
52202
53196
54325
55253
56320
57126
58329
59286
60292
61152
62237
63163
64149
65187
66164
6792
6880
69138
70102
7168
7262
73178
74294
7589
76118
77110
78319
79843
80290
81162
82127
83147
84170
85275
861,290
87614
884,014
892,275
90775
913,630
92362
93147
94110
9571
9675
9762
98228
9971
100174
101545
102304
103212
104284
105516
106236
10739
10869
109131
11082
111102
112465
113621
11447
11559
11643
11740
11843
119443
12072
12142
12233
123392
12466
12546
126381
12719
12899
129906
13052
13119
13211
13399
13410
13515
1366
13722
13844
13933
14024
14133
14239
14348
14480
14520
14628
14719
14817
14911
15020
15157
15235
15324
15431
15535
15655
15789
15857
15988
16018
16147
16256
16336
16463
16599
16644
16764
16886
16970
170111
171106
17225
17339
17414
17525
17653
17720
17856
17911
1807
18140
18232
18317
18468
18538
18616
1873
1884
1892
19012
19118
19237
1939
19410
19511
19618
19745
19827
19911
20039
20123
20237
20322
20421
2057
20640
2077
2088
20916
21029
21120
21221
2137
2144
21512
21721
21813
22012
2212
22215
2234
22412
2259
2261
2278
2283
2296
2308
23131
23226
2336
23417
2356
23623
2371
23811
2392
24010
2417
24211
2431
24414
24521
2463
24712
2481
2496
25010
2515
25218
2537
25417
2554
25616
2578
25824
25917
2604
2611
2623
26312
2643
2658
2672
2681
2693
2714
2721
2731
2741
2753
2766
2795
2801
2811
2844
2851
2861
2882
2891
2905
2914
2937
2944
2951
2961
2971
29970
3002
3014
3021
3037
3051
3062
3072
3091
3107
3111
31314
3161
3201
3216
3222
3233
3244
3274
3282
3291
33013
3315
33211
3333
3341
3351
33611
3371
33820
33911
3402
3416
34210
34337
34425
3459
34632
3474
3489
3497
35012
3512
3525
3547
35832
3597
3606
3611
36221
36314
36451
36517
36718
3689
3707
3716
37215
3731
37438
375113
37657
37737
37823
37987
38065
3821
3862
3881
3915
3921
3941
3973
4011
4031
4081
4092
4105
4111
4134
4151
4181
4201
4273
4282
4302
433314
4371
4502
4531
4681
4691
4835
4841
4861
4882
4899
4904
4922
493106
4943
4951
4962
4981
5121
5391
5531
5602
5702
6001
6023
6431
6462
6571
6631
6701
6722
7295
7321
7391
7441
7591
7781
8191
8591
9561
9592
9642
9651
9731
1,1331
1,1861
1,2672
1,5231
2,5351
2,5361
2,5372
2,5391
2,5401
2,5415
2,5421
2,5451
2,5471
25,7941

Which language developers have the biggest potty mouth?

Working this out is not an exact science. It falls into the NLP class of problems really. Picking up cursing/swearing or offensive terms using filenames from a defined list is never going to be effective. If you do a simple string contains test you pick up all sorts or normal files such as assemble.sh and such. So to produce the following I pulled a list of curse words, then checked if any files in each project start with one of those values followed by a period. This would mean a file named gangbang.java would be picked up while assemble.sh would not. However this is going to miss all sorts of cases such as pu55syg4rgle.java and other such crude names.

The list I used contained some leet speak such as b00bs and b1tch to try and catch some of the most interesting cases. The full list is here.

While not accurate at all as mentioned it is incredibly fun to see what this produces. So lets start with a list of which languages have the most curse words. However we should probably weight this against how much code exists as well. So here are the top ones.

languagefilename curse countpercent of files
C Header7,6600.00126394567906%
Java7,0230.00258792635479%
C6,8970.00120706524533%
PHP5,7130.00283428484703%
JavaScript4,3060.00140692338568%
HTML3,5600.00177646776919%
Ruby3,1210.00223136542655%
JSON1,5980.00293688627715%
C++1,5430.00135977378652%
Dart1,5330.19129310646%
Rust1,5040.038465935524%
Go Template1,5000.0792233157387%
SVG1,2340.00771043360379%
XML1,2120.000875741051608%
Python1,0920.00119138129893%
JavaServer Pages1,0370.0215440542669%

Interesting! My first thought was “those naughty C developers!” but as it turns out while they have a high count they write so much code it probably isn’t that big a deal. However pretty clearly Dart developers have an axe to grind! If you know someone coding in Dart you may want to go offer them a hug.

I also want to know what are the most commonly used curse words. Lets see how dirty a mind we have collectively. A few of the top ones I could see being legitimate names (if you squint), but the majority would certainly produce few comments in a PR and a raised eyebrow.

wordcount
ass11,358
knob10,368
balls8,001
xxx7,205
sex5,021
nob3,385
pawn2,919
hell2,819
crap1,112
anal950
snatch885
fuck572
poop510
cox476
shit383
lust367
butt265
bum151
bugger132
pron121
cum118
cok112
damn105

Note that some of the more offensive words in the list did have matching filenames which I find rather shocking considering what they were. Thankfully they were not very common and didn’t make my list above which was limited to those which had counts over 100. I am hoping that those files only exist for testing allow/deny lists and such.

Longest files by lines per language

As you would probably expect Plain Text, SQL, XML, JSON and CSV take the top positions of this one, seeing as they usually contain meta-data, database dumps and the like.

Limited to 40 because at some point there is only a hello world example or such available and the result is not very interesting. It is not surprising to see that someone has checked in sqlite3.c somewhere but I would be a little worried about that 3,064,594 line Python file and that 1,997,637 line TypeScript monster.

NB Some of the links below MAY not translate 100% due to throwing away some information when I created the files. Most should work, but a few you may need to mangle the URL to resolve.

skip table to next section

languagefilenamelines
Plain Text1366100696temp.txt347,671,811
PHPphpfox_error_log_04_04_12_3d4b11f6ee2a89fd5ace87c910cee04b.php121,930,973
HTMLyo.html54,596,752
LEXl39,743,785
XMLdblp.xml39,445,222
Autoconf21-t2.in33,526,784
CSVontology.csv31,946,031
Prologtop500_full.p22,428,770
JavaScriptmirus.js22,023,354
JSONdouglasCountyVoterRegistration.json21,104,668
Game Maker Languagelg.gml13,302,632
C Headertrk6data.h13,025,371
Objective C++review-1.mm12,788,052
SQLnewdump.sql11,595,909
Patchclook_iosched-team01.patch10,982,879
YAMLdata.yml10,764,489
SVGlarge-file.svg10,485,763
Sasslarge_empty.scss10,000,000
AssemblyJ.s8,388,608
LaTeXtex8,316,556
C++ Headerprimpoly_impl.hh8,129,599
LispsimN.lsp7,233,972
PerlaimlCore3.pl6,539,759
SASoutput.sas5,874,153
CCathDomainDescriptionFile.v3.5.c5,440,052
Luagiant.lua5,055,019
Rdisambisearches.R4,985,492
MUMPSref.mps4,709,289
HEXcombine.hex4,194,304
Pythonmappings.py3,064,594
Schemeatomspace.scm3,027,366
C++Int.cpp2,900,609
Properties Filenuomi_active_user_ids.properties2,747,671
AlexDalek.X2,459,209
TCLTCL2,362,970
Rubysmj_12_2004.rb2,329,560
Wolframhmm.nb2,177,422
BrainfuckBF2,097,158
TypeScriptall_6.ts1,997,637
Module-Definitionmatrix.def1,948,817
LESSless1,930,356
Objective Cfaster.m1,913,966
Orgdefault.org1,875,096
JupyterReHDDM - AllGo sxFits-Copy0.ipynb1,780,197
Specman etwitter.e1,768,135
F*Pan_troglodytes_monomers.fst1,739,878
Systemdvideo_clean_lower_tokenized.target1,685,570
VImageMazeChannelValueROM.v1,440,068
Markdowneukaryota.md1,432,161
TeXjapanischtest.tex1,337,456
Fortheuroparl.tok.fr1,288,074
Shelladd_commitids_to_src.sh1,274,873
SKILLhijacked.il1,187,701
CSS7f116c3.css1,170,216
C#Form1.cs1,140,480
gitignore.gitignore1,055,167
Boo3.out.tex1,032,145
JavaMonster.java1,000,019
ActionScriptas1,000,000
MSBuildtrain.props989,860
DD883,308
CoqCompiledDFAs.v873,354
Clojureraw-data.clj694,202
Swig3DEditor.i645,117
Happyy624,673
GLSLcapsid.vert593,618
Verilogpipeline.vg578,418
Standard ML (SML)Ambit3-HRVbutNoHR.sml576,071
SystemVerilogbitcoinminer.v561,974
Visual BasiclinqStoreProcs.designer.vb561,067
Goinfo.go559,236
ExpectArgonne_hourly_dewpoint.exp552,269
Erlangsdh_analogue_data.erl473,924
MakefileMakefile462,433
QML2005.qml459,113
SPDXlinux-coreos.spdx444,743
VHDLcpuTest.vhd442,043
ASP.NETAllProducts.aspx438,423
XML SchemaAdvanceShipNotices.xsd436,055
Elixirgene.train.with.rare.ex399,995
Macromedia eXtensible Markup LanguageStaticFlex4PerformanceTest20000.mxml399,821
Adabmm_top.adb390,275
TypeScript Typingsdojox.d.ts384,171
PascalFHIR.R4.Resources.pas363,291
COBOLcpy358,745
Basicexcel-vba-streams-#1.bas333,707
Visual Basic for ApplicationsDispatcher.cls332,266
Puppetmain_110.pp314,217
FORTRAN Legacyf313,599
OCamlPent.ML312,749
FORTRAN Modernslatec.f90298,677
CoffeeScriptdictionary.coffee271,378
Nixhackage-packages.nix259,940
Intel HEXepdc_ED060SCE.fw.ihex253,836
Scalamodels_camaro.sc253,559
JuliaIJulia 0.jl221,058
SRecode Templateespell.srt216,243
sedCSP-2004fe.SED214,290
ReStructuredTextS40HO033.rst211,403
Bosqueworld_dem_5arcmin_geo.bsq199,238
Emacs Lispubermacros.el195,861
F#Ag_O1X5.5_O2X0.55.eam.fs180,008
GDScript72906.gd178,628
Gherkin Specificationfeature175,229
HaskellExcel.hs173,039
Dartsurnames_list.dart153,144
Bazelmatplotlib_1.3.1-1_amd64-20140427-1441.build149,234
Haxeelf-x86id.hx145,800
IDLall-idls.idl129,435
LD Scriptkernel_partitions.lds127,187
Monkey CLFO_BT1-point.mc120,881
Modula3tpch22.m3120,185
BatchEZhunter.cmd119,341
Rustdata.rs114,408
Ur/Webdict.ur-en.ur113,911
Unreal Scriptorfs.derep_id97.uc110,737
Groovygroovy100,297
Smarty Templateassign.100000.tpl100,002
Bitbakebb100,000
BASHpalmer-master-thesis.bash96,911
PSL Assertiontest_uno.psl96,253
ASPsat_gbie_01.asp95,144
Protocol Buffersselect1.proto89,796
Report Definition LanguageACG.rdl84,666
PowershellPresentationFramework.ps183,861
Jinjajinja276,040
AWKwords-large.awk69,964
LOLCODElol67,520
Wrenreuse_constants.wren65,550
JSXAEscript.jsx65,108
Rakefileseed.rake63,000
Stata.31113.do60,343
Vim Scriptddk.vim60,282
SwiftGoogle.Protobuf.UnittestEnormousDescriptor.proto.swift60,236
Korn Shellattachment-0002.ksh58,298
AsciiDocindex.adoc52,627
Freemarker Templatedesigned.eml.ftl52,160
CythonCALC.pex.netlist.CALC.pxi50,283
m4ax.m447,828
Extensible Stylesheet Language Transformationsgreen_ccd.xslt37,247
Licensecopyright37,205
JavaServer Pages1MB.jsp36,007
Document Type Definitionbookmap.dtd32,815
FishGodsay.fish31,112
ClojureScriptcore.cljs31,013
Robot Frameworkrobot30,460
Processingdata.pde30,390
Ruby HTMLbig_table.rhtml29,306
ColdFusionspreadsheet2009Q1.cfm27,974
CMakeListOfVistARoutines.cmake27,550
ATStest06.dats24,350
Nimwindows.nim23,949
VueOgre.vue22,916
Razorvalidationerror.cshtml22,832
Spice Netlistinput6.ckt22,454
IsabelleWooLam_cert_auto.thy22,312
XAMLSymbolDrawings.xaml20,764
Opalangp4000_g+5.0_m0.0_t00_st_z+0.00_a+0.00_c+0.00_n+0.00_o+0.00_r+0.00_s+0.00.opa20,168
TOMLtoo_large.toml20,000
Madlangevgg.mad19,416
Stylustest.styl19,127
Go Templatehtml-template.tmpl19,016
AutoHotKeyglext.ahk18,036
ColdFusion CFScriptIntakeHCPCIO.cfc17,606
Zsh_oc.zsh17,307
Twig Templateshow.html.twig16,320
ABAPZRIM01F01.abap16,029
Elm57chevy.elm14,968
Kotlin_Arrays.kt14,396
Varnish Configuration40_generic_attacks.vcl13,367
Mustachehuge.mustache13,313
Alloyoutput.als12,168
Device Treetegra132-flounder-emc.dtsi11,893
MQL4PhD Appsolute System.mq411,280
Jadefugue.jade10,711
Q#in_navegador.qs10,025
JSONLtrain.jsonl10,000
Flow9graph2.flow9,902
Valamwp.vala8,765
Handlebarstheme.scss.hbs8,259
CrystalCR8,084
C Shellplna.csh8,000
Hamlethamlet7,882
BuildStreambiometrics.bst7,746
Makoverificaciones.mako7,306
AgdaPifextra.agda6,483
Thriftconcourse.thrift6,471
Fragment Shader Filems812_bseqoslabel_l.fsh6,269
Cargo LockCargo.lock6,202
XtendUMLSlicerAspect.xtend5,936
Arvotest-extra-large.avsc5,378
SconsSConstruct5,272
Closure Templatebuckconfig.soy5,189
GNBUILD.gn4,653
Softbridge Basicowptext.sbl4,646
PKGBUILDPKGBUILD4,636
OzStaticAnalysis.oz4,500
Luciusbootstrap.lucius3,992
CeylonRedHatTransformer.ceylon3,907
CreoleMariaDB_Manager_Monitors.creole3,855
LunaBase.luna3,731
Gradledependencies.gradle3,612
MQL HeaderIncGUI.mqh3,544
Cabalsmartword.cabal3,452
Emacs Dev Envede3,400
Mesonmeson.build3,264
nuspecNpm.js.nuspec2,823
Game Maker ProjectLudumDare.yyp2,679
Juliusdefault-layout.julius2,454
Idrisring_reduce.idr2,434
Alchemistout.lmf-dos.crn2,388
MQL5DTS1-Build_814.1_B-test~.mq52,210
Android Interface Definition LanguageITelephony.aidl2,005
Vertex Shader Filesdk_macros.vsh1,922
Leaninteractive.lean1,664
Jenkins BuildfileJenkinsfile1,559
FIDLamb.in.fidl1,502
Ponyscenery.pony1,497
PureScriptprelude.purs1,225
TaskPapertask-3275.taskpaper1,196
DockerfileDockerfile1,187
JanetJanet1,158
Futharkmath.fut990
Zigmain.zig903
XCode ConfigProject-Shared.xcconfig522
JAILCregistryFile.jai489
QCLbwt.qcl447
Ur/Web Projectreader.urp346
Cassiusdefault-layout.cassius313
Docker ignore.dockerignore311
DhalllargeExpressionA.dhall254
ignore.ignore192
Bitbucket Pipelinebitbucket-pipelines.yml181
JustJustfile95
Verilog Args Fileor1200.irunargs60
Pollypolly26

Whats the most complex file in each language?

Once again these values are not directly comparable to each other, but it is interesting to see what is considered the most complex in each language.

Some of these files are absolute monsters. For example consider the most complex C++ file I found COLLADASaxFWLColladaParserAutoGen15PrivateValidation.cpp which is 28.3 MB of compiler hell (and thankfully appears to be generated).

NB Some of the links below MAY not translate 100% due to throwing away some information when I created the files. Most should work, but a few you may need to mangle the URL to resolve.

skip table to next section

languagefilenamecomplexity
C++COLLADASaxFWLColladaParserAutoGen15PrivateValidation.cpp682,001
JavaScriptblocks.js582,070
C Headerbigmofoheader.h465,589
CfmFormula.c445,545
Objective Cfaster.m409,792
SQLdump20120515.sql181,146
ASP.NETresults-i386.master164,528
JavaConcourseService.java139,020
TCL68030_TK.tcl136,578
C++ HeaderTPG_hardcoded.hh129,465
TypeScript Typingsall.d.ts127,785
SVGClass Diagram2.svg105,353
LualuaFile1000kLines.lua102,960
PHPfopen.php100,000
Org2015-02-25_idfreeze-2.org63,326
Rubyall_search_helpers.rb60,375
Schemetest.ss50,000
Stata.31113.do48,600
Elixirpmid.sgd.crawl.ex46,479
BrainfuckPoll.bf41,399
Perlr1d7.pl41,128
Gosegment_words_prod.go34,715
Pythonlrparsing-sqlite.py34,700
Module-Definitionwordnet3_0.def32,008
Clojureraw-data.clj29,950
C#Matrix.Product.Generated.cs29,675
Dparser.d27,249
FORTRAN Moderneuitm_routines_407c.f9027,161
Puppetsqlite3.c.pp25,753
SystemVerilog6s131.sv24,300
AutoconfMakefile.in23,183
Specman ehansards.e20,893
Smarty Templatetest-include-09.tpl20,000
TypeScriptJSONiqParser.ts18,162
Valtera_mf.v13,584
F*slayer-3.fst13,428
TeXdefinitions.tex13,342
SwiftGoogle.Protobuf.UnittestEnormousDescriptor.proto.swift13,017
Assemblyall-opcodes.s12,800
Bazelfirebird2.5_2.5.2.26540.ds4-10_amd64-20140427-2159.build12,149
FORTRAN Legacylm67.F11,837
RRallfun-v36.R11,287
ActionScriptAccessorSpray.as10,804
HaskellTags.hs10,444
Prologbooks_save.p10,243
DartDartParser.dart9,606
VHDLunisim_VITAL.vhd9,590
Batchtest.bat9,424
Boocompman.tex9,280
CoqNangateOpenCellLibrary.v8,988
Shelli3_completion.sh8,669
Kotlin1.kt7,388
JSXtypescript-parser.jsx7,123
MakefileMakefile6,642
Emacs Lispbible.el6,345
Objective C++set.mm6,285
OCamlsparcrec.ml6,285
Expectcondloadstore.stdout.exp6,144
SASimport_REDCap.sas5,783
Juliapilot-2013-05-14.jl5,599
Cythontypes.pyx5,278
Modula3tpch22.m35,182
HaxeT1231.hx5,110
Visual Basic for ApplicationsCoverage.cls5,029
LispsimN.lsp4,994
ScalaSpeedTest1MB.sc4,908
GroovyZulTagLib.groovy4,714
PowershellPresentationFramework.ps14,108
Adabhps-print_full_version.adb3,961
JavaServer Pagessink_jq.jsp3,850
GNpatch-third_partyffmpegffmpeg_generated.gni3,742
BasicMSA_version116_4q.bas3,502
PascalPython_StdCtrls.pas3,399
Standard ML (SML)arm.sml3,375
Erlanglipsum.hrl3,228
ASPmylib.asp3,149
CSSthree-viewer.css3,071
Unreal ScriptScriptedPawn.uc2,909
CoffeeScriptgame.coffee2,772
AutoHotKeyfishlog5.93.ahk2,764
MQL4PhD Appsolute System.mq42,738
ProcessingFinal.pde2,635
IsabelleStdInst.thy2,401
RazorChecklist.cshtml2,341
Sass_multi-color-css-stackicons-social.scss2,325
Valavalaccodebasemodule.vala2,100
MSBuildall.props2,008
Rustffi.rs1,928
QMLDots.qml1,875
F#test.fsx1,826
Vim Scriptnetrw.vim1,790
Korn Shellattachment.ksh1,773
Vuevue1,738
sedSED1,699
GLSLcomp1,699
Nixauth.nix1,615
Mustachetemplate.mustache1,561
Bitbakemy-2010.bb1,549
Ur/Webvotes.ur1,515
BASHpgxc_ctl.bash1,426
MQL Headerhanoverfunctions.mqh1,393
Visual BasicLGMDdataDataSet.Designer.vb1,369
Q#flfacturac.qs1,359
C Shellregtest_hwrf.csh1,214
MQL5DTS1-Build_814.1_B-test~.mq51,186
XtendParser.xtend1,116
Nimdisas.nim1,098
CMakeMacroOutOfSourceBuild.cmake1,069
Protocol Buffersconfigure.proto997
SKILLswitch.il997
COBOLgeekcode.cob989
Game Maker LanguagehydroEx_River.gml982
Gherkin Specificationupload_remixed_program_again_complex.feature959
Alloybattleformulas.als948
Bosquerecover.bsq924
ColdFusionjquery.js.cfm920
Stylusbuttron.styl866
ColdFusion CFScriptapiUtility.cfc855
Verilogexec_matrix.vh793
Freemarker TemplateDefaultScreenMacros.html.ftl771
Crystallexer.cr753
Forthe4690
Monkey Cmc672
Rakefileimport.rake652
Zshzshrc649
Ruby HTMLext_report.rhtml633
Handlebarstemplates.handlebars557
SRecode TemplateAl3SEbeK61s.srt535
SconsSConstruct522
AgdaSquare.agda491
Ceylonruntime.ceylon467
Juliusdefault-layout.julius436
WolframqmSolidsPs8dContourPlot.nb417
Cabalparconc-examples.cabal406
Fragment Shader Fileflappybird.fsh349
ATSats_staexp2_util1.dats311
Jinjaphp.ini.j2307
Opalangunicode.opa306
Twig Templateproduct_form.twig296
ClojureScriptcore.cljs271
Hamlethamlet270
OzStaticAnalysis.oz267
ElmIndexer.elm267
Mesonmeson.build248
ABAPZRFFORI99.abap244
DockerfileDockerfile243
Wrenrepl.wren242
Fishfisher.fish217
Emacs Dev Envede211
GDScripttiled_map.gd195
IDLbgfx.idl187
Jadedocs.jade181
PureScriptList.purs180
XAMLMidnight.xaml179
Flow9TypeMapper.js.flow173
IdrisUtils.idr166
PSL Assertionpre_dec.psl162
Leankernel.lean161
MUMPSlink.mps161
Vertex Shader Filebase.vsh152
Go Templatecode-generator.tmpl148
Makopokemon.mako137
Closure Templatetemplate.soy121
Zigmain.zig115
TOMLtelex_o.toml100
Softbridge Basicasm.sbl98
QCLbwt.qcl96
Futharkmath.fut86
Ponyjstypes.pony70
LOLCODELOLTracer.lol61
Alchemistalchemist.crn55
MadlangCopying.MAD44
LD Scriptplugin.lds39
Device Treedts22
FIDLGlobalCapabilitiesDirectory.fidl19
JAILICENSE.jai18
JustJustfile7
Android Interface Definition Languageaidl3
Ur/Web ProjectjointSpace.urp2
Spice NetlistGRI30.CKT2

Whats the most complex file weighted against lines?

This sounds good in practice, but in reality… anything minified or with no newlines skews the results making this one effectively pointless. As such I have not included this calculation. I have however created an issue inside scc to support detection of minified code so it can be removed from the calculation results https://github.com/boyter/scc/issues/91

It’s probably possible to infer this using just the data at hand, but id like to make it a more robust check that anyone using scc can benefit from.

Whats the most commented file in each language? I have no idea what sort of information you can get out of this that might be useful but it is interesting to have a look.

NB Some of the links below MAY not translate 100% due to throwing away some information when I created the files. Most should work, but a few you may need to mangle the URL to resolve.

skip table to next section

languagefilenamecomment lines
Prologts-with-score-multiplier.p5,603,870
Ctestgen.c1,705,508
PythonUntitled0.py1,663,466
JavaScript100MB.js1,165,656
SVGp4-s3_I369600.svg1,107,955
SQLtest.sql858,993
C Headerhead.h686,587
C++ResidueTopology.cc663,024
Autoconfsquare_detector_local.in625,464
TypeScriptreallyLargeFile.ts583,708
LEXpolaris-xp900.l457,288
XMLTest1-CDL-soapui-project.xml411,321
HTMLtodos_centros.html366,776
PascalFHIR.R4.Resources.pas363,289
SystemVerilogmkToplevelBT64.v338,042
PHPlt.php295,054
TypeScript Typingsdojox.d.ts291,002
VerilogCVP14_synth.vg264,649
Luaobjects.lua205,006
VTestDataset01-functional.v201,973
JavaFinalPackage.java198,035
C++ Headertest_cliprdr_channel_xfreerdp_full_authorisation.hpp196,958
Shelladd_commitids_to_src.sh179,223
C#ItemId.cs171,944
FORTRAN Modernslatec.f90169,817
AssemblyHeavyWeather.asm169,645
Module-Definitiontop_level.final.def139,150
FORTRAN Legacydlapack.f110,640
VHDLcpuTest.vhd107,882
Groovygroovy98,985
IDLall-idls.idl91,771
WolframK2KL.nb90,224
Gofrequencies.go89,661
Schemes7test.scm88,907
Dcoral.jar.d80,674
Coqcycloneiv_hssi_atoms.v74,936
Specman esysobjs.e65,146
Puppetsqlite3.c.pp63,656
Wrenmany_globals.wren61,388
Boosun95.tex57,018
Rubybigfile.rb50,000
Objective Cjob_sub011.m44,788
CSSscreener.css43,785
SwigCIDE.I37,235
FishGodsay.fish31,103
Sasssm30_kernels.sass30,306
CoffeeScripttmp.coffee29,088
Erlangnci_ft_ricm_dul_SUITE.erl28,306
Lispkm_2-5-33.lisp27,579
YAMLciudades.yml27,168
RPhyloSimSource.R26,023
ScalaGeneratedRedeclTests.scala24,647
Emacs Lisppjb-java.el24,375
HaskellDipole80.hs24,245
ATStest06.dats24,179
m4ax.m422,675
ActionScript__2E_str95.as21,173
Objective C++edges-new.mm20,789
Visual BasicclsProjections.vb20,641
TCL68030_TK.tcl20,616
Nixnix19,605
PerlLF_aligner_3.12_with_modules.pl18,013
Adaamf-internals-tables-uml_metamodel-objects.adb14,535
BatchMAS_0.6_en.cmd14,402
OCamlcode_new.ml13,648
LaTeXpm3dcolors.tex13,092
Properties Filemessages_ar_SA.properties13,074
MSBuildncrypto.csproj11,302
ASP.NETGallerySettings.ascx10,969
Powershellmail_imap.ps110,798
Standard ML (SML)TCP1_hostLTSScript.sml10,790
Darthtml_dart2js.dart10,547
AutoHotKeystudio.ahk10,391
ExpectNavigator.exp10,063
JuliaPETScRealSingle.jl9,417
MakefileMakefile9,204
Fortheuroparl.lowercased.fr9,107
ColdFusionjs.cfm8,786
TeXhyperref.sty8,591
Opalangi18n_language.opa7,860
LESS_variables.less7,394
SwiftCodeSystems.swift6,847
Bazelgcc-mingw-w64_12_amd64-20140427-2100.build6,429
Kotlin_Arrays.kt5,887
SAS202_002_Stream_DQ_DRVT.sas5,597
HaxeCachedRowSetImpl.hx5,438
Rustlrgrammar.rs5,150
Monkey Cmc5,044
Cythonpcl_common_172.pxd5,030
Nimdisas.nim4,547
Game Maker Languagegm_spineapi.gml4,345
ABAPZACO19U_SHOP_NEW_1.abap4,244
XAMLRaumplan.xaml4,193
RazorPrivacy.cshtml4,092
Varnish Configuration46_slr_et_rfi_attacks.vcl3,924
BasicMSA_version116_4q.bas3,892
IsabellePick.thy3,690
Protocol Buffersmetrics_constants.proto3,682
BASHbashrc3,606
Clojureall-playlists-output.clj3,440
F#GenericMatrixDoc.fs3,383
ThriftNoteStore.thrift3,377
COBOLdb2ApiDf.cbl3,319
JavaServer Pagessink_jq.jsp3,204
Modula3gdb.i33,124
Visual Basic for ApplicationsHL7xmlBuilder.cls2,987
Oztiming.oz2,946
Closure Templatebuckconfig.soy2,915
AgdaPifextra.agda2,892
StataR2_2cleaningprocess.do2,660
ColdFusion CFScriptIntake.cfc2,578
LunaBase.luna2,542
Unreal ScriptUIRoot.uc2,449
CMakecmake2,425
Orglens-wsn.org2,417
Flow9index.js.flow2,361
MQL HeaderIncGUI.mqh2,352
JSXContactSheetII.jsx2,243
MQL4PhD Appsolute System.mq42,061
Ruby HTMLFinalOral-Old.Rhtml2,061
GDScriptgroup.gd2,023
Processingtestcode.pde2,014
PSL Assertion2016-08-16.psl2,011
ASPc_system_plugin.asp1,878
AWKdic-generator.awk1,732
Jinjaphp.ini.j21,668
Zsh.zshrc1,588
Q#in_navegador.qs1,568
sedMakefile.sed1,554
Styluspopup.styl1,550
BitbakeDoxyfile.bb1,533
Rakefilesamples.rake1,509
Gherkin SpecificationWorkflowExecution.feature1,421
Crystalstring.cr1,412
Android Interface Definition LanguageITelephony.aidl1,410
XtendProperties.xtend1,363
SKILLDT_destub.il1,181
Madlang.config.mad1,137
Spice NetlistAPEXLINEAR.ckt1,114
QMLMainFULL.qml1,078
GLSLsubPlanetNoise.frag1,051
Ur/Webinitial.ur1,018
AlloyTransactionFeatureFinal.als1,012
Valapuzzle-piece.vala968
Smarty TemplateEnsau.tpl965
Makojobs.mako950
TOMLtraefik.toml938
gitignore.gitignore880
Elixirmacros.ex832
GNrules.gni827
Korn Shelllx_distro_install.ksh807
LD Scriptvmlinux.lds727
SconsSConstruct716
HandlebarsConsent-Form.handlebars714
Device Treeddr4-common.dtsi695
FIDLamb.in.fidl686
JuliusglMatrix.julius686
C Shellsetup_grid.csh645
Leanperm.lean642
IdrisOverview.idr637
PureScriptArray.purs631
Freemarker Templateresult_softwares.ftl573
ClojureScriptlt-cljs-tutorial.cljs518
Fragment Shader Filebulb.fsh464
ElmAttributes.elm434
Jadeindex.jade432
Vueform.vue418
Gradlebuild.gradle416
Luciusbootstrap.lucius404
Go Templatefast-path.go.tmpl400
Mesonmeson.build306
F*Crypto.Symmetric.Poly1305.Bignum.Lemmas.Part1.fst289
CeylonIdeaCeylonParser.ceylon286
MQL5ZigzagPattern_oldest.mq5282
XCode ConfigProject-Shared.xcconfig265
Futharkblackscholes.fut257
Ponyscenery.pony252
Vertex Shader FileCC3TexturableRigidBones.vsh205
Softbridge Basicgreek.sbl192
Cabaldeeplearning.cabal180
nuspecXamarin.Auth.XamarinForms.nuspec156
DockerfileDockerfile152
Mustachemodels_list.mustache141
LOLCODELOLTracer.lol139
BuildStreamastrobib.bst120
JanetJanet101
Cassiusxweek.cassius94
Docker ignore.dockerignore92
Hamletupload.hamlet90
QCLmod.qcl88
Dhallnix.bash.dhall86
ignore.ignore60
JustJustfile46
SRecode Templatesrecode-test.srt35
Bitbucket Pipelinebitbucket-pipelines.yml30
Ur/Web Projectreader.urp22
Alchemistctrl.crn16
Zigmain.zig12
MUMPSmps11
Bosquebosque.bsq8
Report Definition Languageexample.rdl4
Emacs Dev EnvProject.ede3
Cargo LockCargo.lock2
JAIthekla_atlas.jai1

How many “pure” projects

Assuming you define pure to mean a project that has 1 language in it. Of course that would not be very interesting by itself, so lets see what the spread is. As it turns out most projects have fewer than 25 languages in them with most in the less than 10 bracket.

The peak in the below graph is for 4 languages.

Of course pure projects might only have one programming language, but have lots of supporting other formats such as markdown, json, yml, css, .gitignore which are picked up by scc. It’s probably reasonable to assume that any project with less than 5 languages is “pure” (for some level of purity) and as it turns out is just over half the total data set. Of course your definition of purity might be different to mine so feel free to adjust to whatever number you like.

What suprises me is an odd bump around 34-35 languages. I have no reasonable explanation as to why this might be the case and it probably warrents some investigation.

scc-data pure projects

The full list of results is included below.

skip table to next section

language countproject count
1886,559
2951,009
3989,025
41,070,987
51,012,686
6845,898
7655,510
8542,625
9446,278
10392,212
11295,810
12204,291
13139,021
14110,204
1587,143
1667,602
1761,936
1844,874
1934,740
2032,041
2125,416
2224,986
2323,634
2416,614
2513,823
2610,998
279,973
286,807
297,929
306,223
315,602
326,614
3312,155
3415,375
357,329
366,227
374,158
383,744
393,844
401,570
411,041
42746
431,037
441,363
45934
46545
47503
48439
49393
50662
51436
52863
53393
54684
55372
56366
57842
58398
59206
60208
61177
62377
63450
64341
6586
6678
67191
68280
6961
70209
71330
72171
73190
74142
75102
7632
7757
7850
7926
8031
8163
8238
8326
8472
85205
8673
8767
8821
8915
906
9112
9210
938
9416
9524
967
9730
984
991
1006
1017
10216
1031
1045
1051
10619
1082
1092
1101
1113
1121
1131
1143
1155
1165
1181
1205
1241
1251
1312
1321
1342
1361
1371
1381
1421
1432
1441
1581
1592

Projects with TypeScript but not JavaScript

Ah the modern world of TypeScript. But for projects that are using TypeScipt how many are using TypeScript exclusively?

pure TypeScript projects
27,026 projects

Have to admit, I am a little surprised by that number. While I understand mixing JavaScript with TypeScript is fairly common I would have thought there would be more projects using the new hotness. This may however be mostly down to the projects I was able to pull though and I suspect a refreshed project list with newer projects would change this number drastically.

Anyone using CoffeeScript and TypeScript?

using TypeScript and CoffeeScript
7,849 projects

I have a feeling some TypeScript developers are dry heaving at the very thought of this. If it is of any comfort I suspect most of these projects are things like scc which uses examples of all languages mixed together for testing purposes.

What’s the typical path length, broken up by language

Given that you can either dump all the files you need in a single directory, or span them out using file paths whats the typical path length and number of directories?

This is done by counting the number of path separators / for each file and its location and averaging it out. I didn’t know what to expect here other that I would expect java to be close to the top as its file paths are usually quite deep.

skip table to next section

languageaverage path length
ABAP4.406555175781266
ASP6.372800350189209
ASP.NET7.25
ATS4.000007286696899
AWK4.951896171638623
ActionScript8.139775436837226
Ada4.00042700953189
Agda3.9126438455441743
Alchemist3.507827758789091
Alex5.000001311300139
Alloy5.000488222547574
Android Interface Definition Language11.0048217363656
Arvo5.9999994741776135
AsciiDoc3.5
Assembly4.75
AutoHotKey2.2087400984292067
Autoconf5.8725585937792175
BASH2.1289059027401294
Basic3.003903865814209
Batch6.527053831937014
Bazel3.18005371087348
Bitbake2.015624999069132
Bitbucket Pipeline2.063491820823401
Boo4.010679721835899
Bosque4.98316764831543
Brainfuck4.2025654308963425
BuildStream3.4058846323741645
C4.923767089530871
C Header4.8744963703211965
C Shell3.027952311891569
C#3.9303305113013427
C++3.765686050057411
C++ Header5.0468749664724015
CMake4.474763816174707
COBOL2.718678008809146
CSS3.158353805542812
CSV2.0005474090593514
Cabal2.0234456174658693
Cargo Lock2.602630615232607
Cassius3.56445312181134
Ceylon4.750730359584461
Clojure3.992209411809762
ClojureScript4.905477865257108
Closure Template6.800760253008946
CoffeeScript4.503051759227674
ColdFusion6.124976545410084
ColdFusion CFScript6.188602089623717
Coq4.000243186950684
Creole3.124526690922411
Crystal3.1243934621916196
Cython5.219657994911814
D9.291626930357722
Dart3.939864161220478
Device Tree6.530643464186369
Dhall0.12061593477278201
Docker ignore2.9984694408020562
Dockerfile3.1281526535752064
Document Type Definition6.3923129292499254
Elixir3.9999989270017977
Elm2.968016967181992
Emacs Dev Env4.750648772301943
Emacs Lisp2.0156250001746203
Erlang4.756546300111156
Expect5.126907349098477
Extensible Stylesheet Language Transformations4.519531239055546
F#5.752862453457055
F*4.063724638864983
FIDL4.484130888886213
FORTRAN Legacy6.117128185927898
FORTRAN Modern5.742561882347131
Fish3.993835387425861
Flow99.462829245721366
Forth4.016601327653859
Fragment Shader File3.8598623261805187
Freemarker Template11.122007250069213
Futhark6.188476562965661
GDScript3.2812499999872675
GLSL6.6093769371505005
GN3.497192621218512
Game Maker Language4.968749999941792
Game Maker Project3.8828125
Gherkin Specification3.999099795268081
Go3.9588454874029275
Go Template4
Gradle2.655930499769198
Groovy11.499969503013528
HEX3.98394775342058
HTML4.564478578133282
Hamlet3.4842224120074867
Handlebars4.998766578761208
Happy5.699636149570479
Haskell2.000140870587468
Haxe5.999999999999997
IDL6.249999993495294
Idris3.515075657458509
Intel HEX3.983397483825683
Isabelle4.18351352773584
JAI7.750007518357038
JSON3.9999972562254724
JSONL5.751412352804029
JSX5.0041952044625715
Jade4.744544962807595
Janet3.0312496423721313
Java11.265740856469563
JavaScript4.242187985224513
JavaServer Pages7.999993488161865
Jenkins Buildfile2.000000000087315
Jinja6.937498479846909
Julia3.9999848530092095
Julius3.187606761406953
Jupyter2.375
Just4.312155187124516
Korn Shell7.0685427486899925
Kotlin6.455277973786039
LD Script5.015594720376608
LESS5.999999999999886
LEX5.6996263030493495
LOLCODE3.722656242392418
LaTeX4.499990686770616
Lean4.1324310302734375
License4.7715609660297105
Lisp6.00048828125
Lua3.999999057474633
Lucius3.0000303482974573
Luna4.758178874869392
MQL Header5.421851994469764
MQL45.171874999953652
MQL54.069171198975555
MSBuild4.8931884765733855
MUMPS4.999999672174454
Macromedia eXtensible Markup Language3.9139365140181326
Madlang3.625
Makefile4.717208385332443
Mako4.0349732004106045
Markdown2.25
Meson3.342019969206285
Modula33.980173215190007
Module-Definition8.875000973076205
Monkey C3.0672508481368164
Mustache6.000003708292297
Nim3.7500824918105313
Nix2.0307619677526234
OCaml3.269392550457269
Objective C3.526367187490962
Objective C++5.000000834608569
Opalang4.500069382134143
Org5.953919619084296
Oz4.125
PHP7.999984720368943
PKGBUILD4.875488281252839
PSL Assertion5.004394620715175
Pascal5.0781240425935845
Patch3.999999999999819
Perl4.691352904239976
Plain Text5.247085583343509
Polly2.953125
Pony2.9688720703125
Powershell4.596205934882159
Processing3.999931812300937
Prolog4.4726600636568055
Properties File3.5139240025278604
Protocol Buffers6.544742336542192
Puppet6.662078857422106
PureScript4.000007774680853
Python5.4531080610843805
Q#3.7499999999999996
QCL2.992309644818306
QML7.042003512360623
R3.0628376582587578
Rakefile4.78515574071335
Razor8.062499530475186
ReStructuredText5.061766624473476
Report Definition Language5.996573380834889
Robot Framework4.0104638249612155
Ruby5.1094988621717725
Ruby HTML5.57654969021678
Rust3.2265624976654292
SAS4.826202331129183
SKILL6.039547920227052
SPDX4.000203706655157
SQL7.701822280883789
SRecode Template3.500030428171159
SVG5.217570301278483
Sass6.000000000056957
Scala4.398563579539738
Scheme6.999969714792911
Scons5.010994006631478
Shell4.988665378738929
Smarty Template5.000527858268356
Softbridge Basic4.87873840331963
Specman e5.765624999999318
Spice Netlist3.9687499998835882
Standard ML (SML)4.031283043158929
Stata6.27345275902178
Stylus3.5000006667406485
Swift3
Swig5.246093751920853
SystemVerilog2.9995259092956985
Systemd3.9960937500000284
TCL2.508188682367951
TOML2.063069331460588
TaskPaper2.003804363415667
TeX3.500000000931251
Thrift4.956119492650032
Twig Template8.952746974652655
TypeScript4.976589231140677
TypeScript Typings5.832031190521718
Unreal Script4.22499089783372
Ur/Web4.41992186196147
Ur/Web Project5.1147780619789955
V4.251464832544997
VHDL4.000000961231823
Vala3.99804687498741
Varnish Configuration4.006103516563625
Verilog3.6906727683381173
Verilog Args File8.93109059158814
Vertex Shader File3.8789061926163697
Vim Script3.9995117782528147
Visual Basic4.5
Visual Basic for Applications3.6874962672526417
Vue7.752930045514701
Wolfram3.075198844074798
Wren4
XAML4.515627968764219
XCode Config6.969711296260638
XML6
XML Schema5.807670593268995
Xtend4.315674404631856
YAML3.2037304108964673
Zig3.4181210184442534
Zsh2.0616455049940288
gitignore2.51172685490884
ignore10.6434326171875
m43.7519528857323934
nuspec4.109375
sed4.720429063539986

YAML or YML?

Sometime back on the company slack there was a “discussion” with many dying on one hill or the other over the use of .yaml or .yml

The debate can finally(?) be ended. Although I suspect some will still prefer to die on their chosen hill.

extensioncount
yaml3,572,609
yml14,076,349

Upper lower or mixed case?

What case style is used on filenames? This includes the extension so you would expect it to be mostly mixed case.

stylecount
mixed9,094,732
lower2,476
upper2,875

Which of course is not very interesting because generally file extensions are lowercase. What about if we ignore the file extension?

stylecount
mixed8,104,053
lower347,458
upper614,922

Not what I would have expected. Mostly mixed is normal, but I would have thought lower would be more popular.

Java Factories

Another one that came up in the internal company slack when looking through some old Java code. I thought why not add a check for any Java code that has Factory, FactoryFactory or FactoryFactoryFactory in the name. The idea being to see how many factories are out there.

typecountpercent
not factory271,375,57497.9%
factory5,695,5682.09%
factoryfactory25,3160.009%
factoryfactoryfactory00%

So slightly over 2% of all the Java code that I checked appeared to be a factory or factoryfactory. Thankfully there are no factoryfactoryfactories and perhaps that joke can finally die, although I am sure at least one non-ironic one exist somewhere in some Java 5 monolith that makes more money every day than I will see over my entire working life.

Ignore files

The .ignore file idea was hammered out by burntsushi and ggreer in a Hacker News thread and is possibly one of the greatest cases of “competing” open source tools working together to a good outcome and done in record time. It has become the defacto way to add things into source control yet have tools ignore them. As it turns out scc also implements .ignore files but counts them as well. Lets see how well the idea has spread.

skip table to next section

.ignore countproject count
09,088,796
17,848
21,258
3508
4333
543
6130
78
814
983
1049
1135
12112
13736
154
171
184
202
211
232
243
262
271
3431
3519
369
382
391
4312
441
452
465
497
507
5112
522

Future ideas

Id love to do some analysis of tabs vs spaces. Scanning for things like AWS AKIA keys and the like would be pretty neat as well. Id also love to expand out the bitbucket and gitlab coverage and get it broken down via each to see if groups of developers from different camps hang out in different areas.

Shortcomings id love to overcome in the above if I decide to do this again.

  • Keeping the URL properly in the metadata somewhere. Using a filename to store this was a bad idea as it was lossy and means it can be hard to identify the file source and location.
  • Not bother with S3. There is little point to pay the bandwidth cost when I was only using it for storage. Better to just stuff into the tar file from the beginning.
  • Invest some time in learning some tool to help with plotting and charting of results.
  • Use a trie or some other data type to keep a full count of filenames rather than the slightly lossy approach I used.
  • Add an option to scc to check the type of the file based on keywords as examples such as https://bitbucket.org/abellnets/hrossparser/src/master/xml_files/CIDE.C was picked up as being a C file despite obviously being HTML when the content is inspected. To be fair all code counters I tried behave the same way.
  • There appears to be a bug in scc where if a file has no extension but is named as one it will match that file which is incorrect. A bug has been raised in scc to address this https://github.com/boyter/scc/issues/114
  • I’d like to add shebang detection into scc https://github.com/boyter/scc/issues/115
  • Some sort of check against number of github stars would be pretty neat.
  • Analysis against the number of commits would be very interesting.
  • I want to add maintainability index calculations at some point. It would be very cool to see what projects are considered the most maintainable based on their size.

So why bother?

Well I can take some of this information and plug it into searchcode.com and scc. Even if only some useful data points. The stated goal was pretty much this and it is potentially very useful to know how your project compares to others. Besides it was a fun way to spend a few days solving some interesting problems. Also I think it is safe to say that scc is a fairly reliable tool at this point.

In addition, I am working on a tool that helps senior-developer or manager types analyze code looking for languages, large files, flaws etc… with the assumption you have to watch multiple repositories. You put in some code and it will tell you how maintainable it is and what skills you need to maintain it. Useful for determining if you should buy or maintain some code-base and getting an overview of what your development team is producing. Should in theory help teams scale through shared resources. Something like AWS Macie but for code is the angle I am working with. It’s something I need for my day job and I suspect others may find use in it, or at least thats the theory.

I should probably put an email sign up for that here at some point to gather interest for that.

Raw / Processed Files

I have included a link to the processed files (20 MB) for those who wish to do their own analysis and corrections. If someone wants to host the raw files to allow others to download it let me know. It is a 83 GB tar.gz file which uncompressed is just over 1 TB in size. It contents consists of just over 9 million JSON files of various sizes.



from Hacker News https://ift.tt/2o5Ixfa