PTS 2023: the CPAN Meta Analyzer (5/5)
my stupid CPAN “meta analyzer” (again)
I looked at whether I’d blogged about my CPAN distribution analyzer before, and I did. I wrote something eight years ago that began:
Just about exactly five years ago, I wrote a goofy little program that walked through all of the CPAN and produced a CSV file telling me what was used to produce most dists. That is: it looked at the
generated_byfield in the META files and categorized them.
So I guess it’s thirteen years old now! Wow.
As an aside: the first run, thirteen years ago, showed that about 2% of indexed CPAN distributions used Dist::Zilla. Five years later, it was 20%. Today, it’s 28%. With a few more, Dist::Zilla will top ExtUtils::MakeMaker. I don’t know what I think about that, but there it is.
Anyway, what is this thing?
Right now, it’s still not on the CPAN, although I should really get around to
making that happen. It’s on GitHub,
though, and you can clone and install it. You can also see its output over the
last 13 years (sampled very infrequently) on my possibly-moving-somday CPAN
data page. The key bit is the program
analyze-metacpan. When run, it finds your minicpan clone (as long as it’s in
~/minicpan where I would put it) and fires up a CPAN::Visitor. That walks
through the latest version of every distribution on the index. It extracts
them, looks at the files in them, gathers data, and moves on. As it goes, it
records one database row per distribution into an SQLite file. Then, you can
write queries against that SQLite file, like the one that shows what’s been
generating META files:
$ ./bin/top-meta-generators -f ./dist-2023-04-30.sqlite --min 300 There are 11228 dists by 1135 unique cpan ids using Dist::Zilla. generator | dists | authors | % ExtUtils::MakeMaker | 12900 | 3431 | 33.02% Dist::Zilla | 11228 | 1135 | 28.74% | 4666 | 2255 | 11.94% Module::Build | 4334 | 1007 | 11.09% Module::Install | 3367 | 623 | 8.62% Minilla | 1285 | 311 | 3.29% __OTHER__ | 823 | 98 | 2.11% Dist::Milla | 470 | 99 | 1.20%
Here’s what the SQLite file looks like:
CREATE TABLE dists ( distfile PRIMARY KEY, dist, dist_version, cpanid, mtime INTEGER, mdatetime, is_tarbomb INTEGER, file_count INTEGER, has_meta_yml INTEGER, has_meta_json INTEGER, meta_spec, meta_dist_version, meta_generator, meta_gen_package, meta_gen_version, meta_gen_perl, meta_license, meta_yml_error, meta_yml_backend, meta_json_error, meta_json_backend, meta_struct_error, meta_provides_defined, has_makefile_pl INTEGER, has_build_pl INTEGER, has_dist_ini INTEGER ); CREATE TABLE dist_prereqs ( dist, phase, type, module, requirements, module_dist ); CREATE INDEX dist_prereqs_by_dist on dist_prereqs (dist, phase, type); CREATE INDEX dist_prereqs_by_target on dist_prereqs ( module_dist, phase, type );
I use this for lots of little queries. For example, does anybody use the
provides entry in META.json to tell PAUSE exactly how to index the dist?
sqlite> SELECT meta_provides_defined, COUNT(*) FROM dists GROUP BY 1; meta_provides_defined count(*) --------------------- -------- 0 29170 1 9903
Yes! Wow, over a quarter of distributions have contents in their META provides field. What’s making those?
sqlite> SELECT meta_gen_package, COUNT(*) FROM dists WHERE meta_provides_defined='1' GROUP BY 1; meta_gen_package count(*) ----------------------------- -------- ⦰ 253 App::ModuleBuildTiny 75 Dist::Banshee 1 Dist::Iller 47 Dist::Inkt::Profile::KJETILK 10 Dist::Inkt::Profile::Simple 2 Dist::Inkt::Profile::TOBYINK 217 Dist::Milla 73 Dist::Zilla 3454 Dist::Zilla::Plugin::MetaJSON 1 ExtUtils::MakeMaker 350 Minilla 1190 Module::Build 4143 Module::Build::Bundle 1 Module::Install 85 docmaker 1
MakeMaker! Module::Build! What’s actually making these? This query got me a list of dists to investigate:
SELECT dist FROM dists WHERE meta_gen_package = 'ExtUtils::MakeMaker' AND meta_provides_defined='1' ORDER BY dist
Looking at the SQL schema above, you might have noticed it also tracks prereqs. This is really useful! For example, I wrote a program once to answer this question:
Which of my distributions declare a v5.8 prereq (or none), but depend on libraries that require something newer?
In those cases, I might feel more eager to bump the minimum version of my distribution. Let’s say Karen Etheridge is about to do a hunk of code review and wonders which code she could convert to postfix deferencing while at it.
$ ./bin/already-needs \ ./dist-2023-04-30.sqlite \ # Where's our SQLite db? --minimum-target 5.20.0 \ # If we can't bump to v5.20 or better, skip it --cpanid ETHER # We're only considering one author D-Z-App-Command-weaverconf ( v5.6.0) -> ( v5.20.0) via App-Cmd Dist-Zilla Log-Dispatchouli D-Z-P-AuthorityFromModule ( v5.8.0) -> ( v5.20.0) via App-Cmd Dist-Zilla Log-Dispatchouli [ 50 D-Z plugins elided for this blog post!) D-Z-PB-Author-ETHER ( v5.13.2) -> ( v5.20.0) via App-Cmd Dist-Zilla Dist-Zilla-Plugin-CheckPrereqsIndexed plus 4 more D-Z-PB-FLORA ( v5.6.0) -> ( v5.20.0) via App-Cmd Dist-Zilla Dist-Zilla-Plugin-PodWeaver plus 3 more D-Z-PB-Git-VersionManager ( v5.10.0) -> ( v5.20.0) via App-Cmd Dist-Zilla Log-Dispatchouli D-Z-R-FileWatcher ( v5.6.0) -> ( v5.20.0) via App-Cmd Dist-Zilla Log-Dispatchouli D-Z-R-ModuleMetadata ( v5.10.0) -> ( v5.20.0) via App-Cmd Dist-Zilla Log-Dispatchouli D-Z-R-RepoFileInjector ( v5.6.0) -> ( v5.20.0) via App-Cmd Dist-Zilla Log-Dispatchouli JSON-Schema-Draft201909 ( v5.16.0) -> ( v5.20.0) via JSON-Schema-Modern Test-JSON-Schema-Acceptance MooseX-App-Cmd ( v5.8.5) -> ( v5.20.0) via App-Cmd Pod-Weaver-Plugin-Encoding ( v5.6.0) -> ( v5.20.0) via Log-Dispatchouli Pod-Weaver Pod-Weaver-PluginBundle-FLORA ( ~) -> ( v5.20.0) via Log-Dispatchouli Pod-Weaver Task-Kensho-Email ( v5.6.0) -> ( v5.20.0) via Email-MIME-Kit Task-Kensho-ModuleDev ( v5.6.0) -> ( v5.20.0) via App-Cmd Dist-Zilla Log-Dispatchouli Task-Kensho-Toolchain ( v5.6.0) -> ( v5.20.0) via App-Cmd
The actual output is colorized for skimming.
There are tools for computing river scores and related data:
$ ./bin/river-scores dist-2023-04-30.sqlite \ --format score --format minperl --format dist \ # what columns? --min-score 3 \ # stop where in the river? | head score minperl dist 5 5.006 ExtUtils-MakeMaker 5 5.006002 Test-Simple 5 ~ PathTools 4 5.006 Scalar-List-Utils 4 ~ Carp 4 5.006 File-Temp 4 ~ Exporter 4 ~ Data-Dumper 4 ~ Encode
Wait, what does this have to do with the PTS in Lyon?
Hey, great question.
The short version is: I added a few fields, like the “does it use meta provides” and “is the archive a tarball?”. I also added a YYYY-MM-DD style date field, where previously there had only been epoch seconds. They’re both there now, because the epoch seconds is much easier to use in SQL queries.
I used these to help inform some conversations in deciding PAUSE policies. How often does such and thing get used? If never, it’s easier to say we’ll drop it from the code. I use this code myself when thinking about what I might do when updating my own code.
I update this program once in a while, or just regenerate the analysis with today’s CPAN.