[GoLUG] Mailman archive restoration from subscriber/participant emails (was: Web host discussion, 7/5/2023)
Rick Moen
rick at linuxmafia.com
Thu Jul 6 16:48:28 EDT 2023
Quoting Syeed Ali (syeedali at syeedali.com):
> I've learned how to convert a participants email archives into what
> mailman can use for its archives. It's tested and works, and I have a
> chunk of data stretching back to 2016 with more to come.
That's phenomenal, Sy. You've gone above and beyond.
Going forward, it'll be a great idea to have a regular cron job that
does:
1. rsync backup to off-system of the
/var/lib/mailman/archives/private/golug.mbox/golug.mbox file.
2. Mails out periodically to a list of trusted people the output
of "/var/lib/mailman/bin/list_members -f golug".
Any trusted individual, armed with only those two bits of information,
can then re-host the mailing list anywhere needed, _without_ losing
anything important. In particular, that captures the full cumulative
archive and the full membership roster. Other bits of metadata, such as
subscriber passwords and options, and per-mailing list administrative
settings, are IMO fine points whose backup can be put off for a later
day. Hit the low-hanging fruit first.
Restoring a Mailman archive from a copy of the cumulative mbox file
is surprisingly easy. Pretty much like this:
# su - list
$ cd /var/lib/mailman
$ bin/arch -q --wipe golug archives/private/golug.mbox/golug.mbox
$
/var/lib/mailman/bin/arch (not to be confused with /usr/bin/arch)
is the pipermail archiver program. It generates the HTML archive and
the .txt archive from the specified cumulateive mbox file. "-q" is
quiet mode, i.e., yes you could tell me lots of progress information,
but I'd prefer you just say nothing until you complete. The program
will run for quite a few minutes, so perhaps it's more reassuring to
_not_ do "quiet mode", the first time.
"--wipe" means "First wipe out the original archive before
regenerating."
There are occasional glitches with "arch" misparsing lines with
flush-left "From " lines inside messages as the start of new messages,
which you then see as garbage messages at the beginning or end of the
generated HTML archive. There are a variety of fixes for this,
including just using your favourite scripting tools to find those
flush-left occurrences of "From " and edit them to ">From ".
(You then re-run the "arch" command, to try again.)
Anyway, the screwups of _not_ having made copies of the prior
cumulative mbox file means you inherit a more-difficult job, that
if I understand correctly amounts to "construct an mbox file from
someone's set of messages in a not-exactly-mbox format".
> There are two considerations which I will have to address:
>
> 1. Filtering emails with X-No-Archive headers.
>
> https://en.wikipedia.org/wiki/X-No-Archive
A quick grep of your source files should reveal whether anyone has even
introduced this problem at all. I suspect you'll find little or none
of that in the past messages.
And, frankly, GoLUG should be grateful that you're resurrecting its
archive at all. I don't think holding up the project because someone
might have not wanted his/her postings to a public mailing list to be,
y'know, public is reasonable. Just my opinion.
> 2. Filtering off-list conversations. Some are stored alongside
> mailing list emails and might be sent to me, and those must not be
> uploaded into Mailman archives.
Sure. But, again, grepping for the mailing list address as an addressee
in the "To: " or "Cc: " line should be sufficient. (Weeding out
messages that lack that.)
Be aware that, sometimes, a Mailman list will have been configured to
be addressable in multiple valid ways. Invented example (no claim
that this was used):
tech at golug.org AND tech at lists.golug.org
> This is probably straightforward to figure out by checking that only
> emails which are either to or cc to the mailing list are included. I
> think the problem would be to identify what iterations of what domain
> names constitutes "the mailing list"; there seems to be more than one
> generation of this one:
Right!
>
> 3. Deduplication. Maybe. Claws Mail can trivially deduplicate.
> However when I export and then re-import them, some emails are
> duplicated. It's not a display problem, they are real emails. I don't
> know if importing into Mailman will automatically deduplicate those.
/var/lib/mailman/bin/arch will not deduplicate, no.
> I do know there are other processes to deduplicate other than this email
> client. Deduplication would have to be sorted out if multiple users
> provided emails from their separate subscriptions.
Spot on!
> 4. Personal information. People who give me their emails are also
> giving me personal information embedded within them. Testing does not
> show any personal information appearing in archives, so I'm confident
> there. However, I wonder if exporting the Mailman archives and then
> looking within them would show anything.
One possible approach: Tell subscribers you will generate an archive
from reassembled copies of the previous public mailing list archive
on day X, and that they have two weeks to advise about what allegedly
_specified_ sensitive contents in _specified_ message URLs in the HTML
archive they just cannot stand being public, and that you will then
redact that data.
Decades ago, new SVLUG leadership had harmed the group by summarily
deleting several of the working-team mailing lists and replacing them
with a new combined group "volunteers" that was initially both moderated
and limited-membership and privately archived. After that President and
VP left office, and I was left to clean up their mess, I made an
announcement that the "volunteers" list archive would be converted to
public (on grounds that it should have been public all along).
One guy squawked and complained that several of his postings to that
mailing list had included his direct work telephone number at Cisco
Systems. So, easy fix (after grepping the mbox file and finding out
which messages):
# su - list
$ mutt -f /var/lib/mailman/archives/private/volunteers.mbox/volunteers.mbox
Then, I just did "e" (edit) of each message the guy had posted,
and changed his telephone number to "[listadmin: tel. # redacted at
poster's request]".
I could have just done the edit with sed, of course. I think I did
it with mutt just because I wanted to see each message, and view/confirm
what I was doing.
Then, you do the "bin/arch -q -wipe ..." thing.
Be aware that, when you edit the mbox and regenerate the archive,
often bin/arch will renumber some of the archived messages, changing
their URLs.
> Many hosting plans allow:
>
> - Point-and-click Mailman setup.
Yeah, but those hosting plans typically do _not_ permit access to the
/var/lib/mailman/bin/* administrative tools, which are IMO essential for
any Mailman siteadmin.
(In Mailman lingo, the "siteadmin" has admin access to all lists, and
is usually a root-yielding sysadmin. A "listadmin" has ability to
administer one or more specific mailing list via the admin WebUI,
and perhaps the ability to do specific /var/lib/mailman/bin/* things
as user "list" via /etc/sudoers.)
I personally take a dim view of "hosting plans" that suggest that
they're giving you full siteadmin powers, but then it turns out to be
mediated by something regrettable like Webmin, and endlessly frustrating
in what you _cannot_ do -- which most often includes creating and
managing backup tools.
> (I do presently have a problem with exporting Mailman's complete
> archive. I did it once but can't figure out how to do it again, and
> this is worrying.)
May I help? Ask, and I'll try to assist -- as long as you're talking
about a site where you can be user "list" at the command line (which
is what I'm used to).
One way to work around the "we promised you siteadmin but give you only
Webmin access" problem is to use a VPS provider -- but there's a catch:
Often, the IP address pools at VPS providers have terrible spam
reputation at the RBLs, because of dirtbag past customers. So,
caveat emptor. (**COUGH** Linode **COUGH**).
Since the 1980s, my own solution has been to have static IP service
at my house. This has been becoming more difficult in general terms,
as the ISP industry is becoming more hostile to home server operations
in various ways. But I haven't wanted to pay for colo hosting, and
actually like owning/operating/controlling my own hardware and software,
so I'm leery of VPS rental for my own needs.
More information about the GoLUG
mailing list