[GoLUG] Mailman archive restoration from subscriber/participant emails (was: Web host discussion, 7/5/2023)
Syeed Ali
syeedali at syeedali.com
Fri Jul 7 14:55:24 EDT 2023
On Thu, 6 Jul 2023 13:48:28 -0700
Rick Moen <rick at linuxmafia.com> wrote:
> Quoting Syeed Ali (syeedali at syeedali.com):
>
> Going forward, it'll be a great idea to have a regular cron job that
> does:
>
> 1. rsync backup to off-system of the
> /var/lib/mailman/archives/private/golug.mbox/golug.mbox file.
>
> 2. Mails out periodically to a list of trusted people the output
> of "/var/lib/mailman/bin/list_members -f golug".
Noted, but I don't think I can do that myself as I'm using a
point-and-click host. I may be able to ask them for help for this and
provide it on some particular URL like golug.org/golug.mbox
> Other bits of metadata, such as subscriber passwords and options, and
> per-mailing list administrative settings, are IMO fine points whose
> backup can be put off for a later day.
Oh I didn't even realize this. I don't know if there's a way to export
those data with the web admin interface. I definitely need this
figured out at some point, because resurrecting this mailing list has
required a person willing to manually email a notice to recent
participants to re-subscribe (which is not a good idea, expanded on
later).
It's theoretically possible for me to script something that'll crawl
through a mailing list archive and drop participants in a date-sorted
list.
> Restoring a Mailman archive from a copy of the cumulative mbox file
> is surprisingly easy.
I had help from my hosting provider. Their tech support was prompt and
happy to take the couple of minutes to help.
> There are occasional glitches with "arch" misparsing lines with
> flush-left "From " lines inside messages as the start of new messages,
> which you then see as garbage messages at the beginning or end of the
> generated HTML archive. There are a variety of fixes for this,
> including just using your favourite scripting tools to find those
> flush-left occurrences of "From " and edit them to ">From ".
Urk. I don't think I understand. If this were to happen, would this be
seen in with a web browser looking at the mailman list archives?
> Anyway, the screwups of _not_ having made copies of the prior
> cumulative mbox file means you inherit a more-difficult job, that
> if I understand correctly amounts to "construct an mbox file from
> someone's set of messages in a not-exactly-mbox format".
You do.
> > There are two considerations which I will have to address:
> >
> > 1. Filtering emails with X-No-Archive headers.
> >
> > https://en.wikipedia.org/wiki/X-No-Archive
>
> A quick grep of your source files should reveal whether anyone has
> even introduced this problem at all. I suspect you'll find little or
> none of that in the past messages.
I found a little.
> And, frankly, GoLUG should be grateful that you're resurrecting its
> archive at all. I don't think holding up the project because someone
> might have not wanted his/her postings to a public mailing list to be,
> y'know, public is reasonable. Just my opinion.
It's my opinion too, but it would be valuable to pursue a fix that
others can use; it'll be easy enough.
> > 3. Deduplication.
>
> /var/lib/mailman/bin/arch will not deduplicate, no.
>
> > I do know there are other processes to deduplicate other than this
> > email client. Deduplication would have to be sorted out if
> > multiple users provided emails from their separate subscriptions.
>
> Spot on!
I'll have to research and note the deduplication efforts by others,
even if I don't use any of them.
I think what I'll end up with for myself is content from one single
person and deduplication with Claws Mail.
I'm left with the mystery that one particular mailing list participant
has (some? all?) emails survive either an automated- or hand-
deduplication through an export and re-import. I'll do a diff across
the emails to learn more.
/shrug but it's like 40 emails across several years.
> > 4. Personal information. People who give me their emails are also
> > giving me personal information embedded within them. Testing does
> > not show any personal information appearing in archives, so I'm
> > confident there. However, I wonder if exporting the Mailman
> > archives and then looking within them would show anything.
>
> One possible approach: Tell subscribers you will generate an archive
> from reassembled copies of the previous public mailing list archive
> on day X, and that they have two weeks to advise about what allegedly
> _specified_ sensitive contents in _specified_ message URLs in the HTML
> archive they just cannot stand being public, and that you will then
> redact that data.
>
> [snip]
>
> Then, I just did "e" (edit) of each message the guy had posted,
> and changed his telephone number to "[listadmin: tel. # redacted at
> poster's request]".
We have our wires crossed.
When I look at the physical files referenced by my email client (that
is, the raw source), I see lots of non-conversation information. For
example, when I look at the email which the mailing list sent to me, and
which is your reply, I see information about my ISP. Were I to give
that email to mailman, I wouldn't want that ISP information "getting
out". The question is, would the mailman import discard those data or
would they remain in its own internal archives and become visible on a
future export/backup?
(A small note that mass-emailing all the former-users is a major
hairball of old-addresses, replies by and conversations with recipients,
hosting rules, spam rules that cross international borders, etc.)
> Be aware that, when you edit the mbox and regenerate the archive,
> often bin/arch will renumber some of the archived messages, changing
> their URLs.
Ick.
I consider this a really offensive bug.
There are other projects, namely git, which have no consideration for
auditing and purging old data in sensible ways. It's an interesting
conversation topic, e.g. git wasn't made for a single person to use
offline and have a revisionist history.
> Yeah, but those hosting plans typically do _not_ permit access to the
> /var/lib/mailman/bin/* administrative tools, which are IMO essential
> for any Mailman siteadmin.
That's the boat I'm in, but my host is fantastic about lending a hand
for this project. Once it's perfected I think I'll be doing this for
one other stranded mailing list which was hosted alongside GoLUG.
Thereafter I think we'll be okay without a real admin.
> > (I do presently have a problem with exporting Mailman's complete
> > archive. I did it once but can't figure out how to do it again, and
> > this is worrying.)
>
> May I help? Ask, and I'll try to assist -- as long as you're talking
> about a site where you can be user "list" at the command line (which
> is what I'm used to).
I do not have the necessary elevated commandline, but the functionality
is *supposed* to be available via the web by visiting a specified URL
and logging in as the list admin. I should be able to figure it out;
presumably it's just me not understanding things.
> One way to work around the "we promised you siteadmin but give you
> only Webmin access" problem is to use a VPS provider -- but there's a
> catch: Often, the IP address pools at VPS providers have terrible spam
> reputation at the RBLs, because of dirtbag past customers. So,
> caveat emptor. (**COUGH** Linode **COUGH**).
My hosting is what it is, and I doubt I'll change unless this latest
host *also* goes belly-up. (Host lifespan was actually part of a
discussion at our most recent GoLUG Jitsi meeting.)
I personally will never do any sort of self- or VPS-hosting; I've
retired from that complexity.
I'm happy to hand any of this off to others who can do things better in
whatever way, but the whole idea is just to "make rocket go now" and
have it work without cost for the reasonable future.
> But I haven't wanted to pay for colo hosting, and actually like
> owning/operating/controlling my own hardware and software, so I'm
> leery of VPS rental for my own needs.
I experimented with this, and in hindsight I wasted the most creative
years of my life.
More information about the GoLUG
mailing list