• Thu. Sep 22nd, 2022

Behind the scenes blog of pretix, your friendly ticketing system from next door.

Nov 29, 2020

Its been a while since we shared the story of an incident with you, and thats probably a good thing
most operational incidents we had in the past year were boring enough in nature to fix them easily.
This time, weve got a story of a data loss, caused by pure and simple human error and the story of
how we recovered the data.
Even though it is quite embarrassing how the data loss happened, we think its worth sharing the story
of its recovery, as it might allow you to learn a few useful things in case you ever end up in a
similar situation.
As you might have seen, over the last 7 months, weve extended our offerings beyond ticketing to allow
our customers to transform their events into the digital space as long as the global pandemic makes
traditional event formats impossible. The result of our effort is a joint venture called
Venueless that you should absolutely check out if you havent yet.
One component of the virtual events we run on venueless is live video streaming. In this process,
our customers use a tool like OBS or StreamYard
to create a live video stream. The stream is then sent to an encoding server of ours via RTMP.
On the encoding server, we re-encode the stream into different quality levels and then distribute
it to our very own tiny streaming CDN.
Venueless currently does not yet include a video-on-demand component and usually, our customers record
their content at the source, e.g. with OBS or StreamYard, and process or publish them on their own.
However, just to be safe, we keep a recording of the incoming stream as well. This isnt currently
part of our promoted service offering, we rather see it as a free backup service to our clients in case they
lose their recording. Given that we already consider it to just be a backup, we currently dont make any
further backups of this data.
Data loss
Usually, we delete these recordings after a while, but in some cases, our customers ask us to get them, e.g.
because their own recording failed, or because StreamYard only records the first 8 hours of every
stream. Since this doesnt happen a lot, its not yet an automated process in our system. Whenever a customer
requests a recording, we SSH into the respective encoding server and move the recording file to a
directory thats accessible through HTTP, like this:
/var/recordings $ mv recording-12345.flv public/
Thats it, we share the link with the customer, and the process is done. One of the simplest steps possible
in all this. Yesterday, a customer asked us for the recordings of the two last streams of their event. Just
before finishing up for the week, I wanted to supply them with the required file, SSHd into the server,
looked for the correct files and typed
/var/recordings $ mv recording-16678.flv recording-16679.flv
Oops. I hit return before typing out public/, and therefore replaced the last stream with the
second-last, losing one of the videos.
Damage control
Having a very naive understanding of how file systems work, I knew that the mv command has only
changed the directory listing of the file system, but hasnt actually wiped the file from the disk,
so I knew there is likely still a chance to recover the file, if its not overwritten by something
else in the meantime.
Since I didnt manage to re-mount the root partition as read-only to avoid further damage softly,
I used the big hammer to remount
everything read-only immediately:
# echo u > /proc/sysrq-trigger
Uhm, okay, this worked, but how do I install any data recovery tools now? After some experiments,
I decided it would be easiest to reboot into the recovery system provided by our server provider
Hetzner. So I configured the boot loader to boot their recovery system
from the network and forcefully rebooted the server.
To be able to perform disk dumps and have some operational flexibility without downloading a 2 TB
disk image to my local machine (which would take rougly a week), I also quickly purchased
a Hetzner Storage Box with 5 TB space.
Failed attempts
Just before I executed my fatal mv command, I executed ls -lisah to get a directory listing
of the files:
3146449 1.1G -rw-r–r– 1 www-data www-data 1.1G Nov XX XX:XX recording-16678.flv
3146113 1.6G -rw-r–r– 1 www-data www-data 1.6G Nov XX XX:XX recording-16679.flv
This meant I knew the inode number of the deleted file! As I mentioned before, my understanding
of file systems was (and is) rather naive, and I was pretty optimistic to be able to recover the
file using that information. Isnt that sort of what a journaling file system is for?
Recovering the file this way hover appeared to be impossible. ext4magic
and extundelete are powerful tools that did find some
deleted files on my disk but not the one I was looking for, even after trying different options
for over two hours.
I did not spend the time to really understand how ext4 works, but from I gathered from various
blogs, I was pretty much out of luck since the inode did no longer contain the relevant information
and ext4magic also wasnt able to recover the neccessary information from the journal
either.
debugfs: inode_dump <3146113>
0000 a081 0000 8503 0000 e83a c15f e83a c15f ………:._.:._
0020 e83a c15f 0000 0000 7200 0100 0800 0000 .:._….r…….
0040 0000 0800 0100 0000 0af3 0100 0400 0000 …………….
0060 0000 0000 0000 0000 0100 0000 e6eb c000 …………….
0100 0000 0000 0000 0000 0000 0000 0000 0000 …………….
*
0140 0000 0000 92d0 2cf5 0000 0000 0000 0000 ……,………
0160 0000 0000 0000 0000 0000 0000 6fb2 0000 …………o…
0200 2000 e3fb 208a 515b 7c65 5d5a 7c65 5d5a … .Q[e]Ze]Z
0220 e83a c15f 7c65 5d5a 0000 0000 0000 0000 .:._e]Z……..
0240 0000 0000 0000 0000 0000 0000 0000 0000 …………….
*
However, if youre in a similar situation the ext4magic how-tos are really helpful and worth a try.
Successful recovery
There is this one other approach to file recovery that is often recommended on the internet, usually
for small text files: Just grep your whole disk for known parts of its contents! So why wouldnt
this work on larger non-text files as well?
The first problem is obviously what to grep for. The only thing I know about the missing file, apart
from its rough size, is that its a FLV video file. Luckily, all FLV files
that contain video start with the byte sequence FLV\x01\x05. So lets search our 2 TB disk for
that byte sequence and print out the byte offset of all occurences!
cat /dev/md2 \
pv -s 1888127576000 \
grep -P –byte-offset –text ‘FLV\x01\x05’ \
tee -a /mnt/storagebox/grep-log.txt
This took roughly 7 hours. The pv command with the (rough) total size of the disk is optional, but gives you
a nice progress bar. Overall, this took a little over 6 hours on our server.
grep works line-based, which in a binary file menas any byte sequence between two ASCII line breaks. The
log file therefore contained lots of lines like this:
184473878409:<some binary data>FLV<some binary data>
In total, the search found 126 FLV file headers on our disk. This was pretty reassuring, since we had 122 FLV files
still known to the file system so there are at least four FLV byte sequences without a filename!
# find /mnt/disk/var/recordings/ -name ‘*.flv’ -not -empty -ls wc -l
122
Now, I needed to find out which of the 126 byte sequences did not have a filename. Since I really didnt want
to spend all weekend with a deep-dive into the ext4 disk layout, I went for an easier solution: For every file
still known in the file system, I computed a hash of the first 500 kilobytes of the file:
#!/usr/bin/python3
importglobimporthashlibimportoshashsize=500*1024known_hashes={}not_deleted_files=sorted(glob.glob(‘/mnt/disk/var/recordings/*.flv’)+glob.glob(‘/mnt/disk/var/recordings/public/*.flv’))# Ignore files shorter than our hash size
not_deleted_files=[fforfinnot_deleted_filesifos.stat(f).st_size>hashsize]forfnameinnot_deleted_files:withopen(fname,’rb’)asf:h=hashlib.md5(f.read(hashsize)).hexdigest()ifhinknown_hashes:print(“duplicate hash found:”)known_hashes[h]=fnameprint(h,fname)print(len(not_deleted_files),”files with”,len(known_hashes),”hashes”)
Interestingly, two files from the completely different customers shared the same hash of the first 500 kilobytes.
I havent tested it yet, but my theory is that those were streams that just did not contain any audio or video
in their first minutes, but only empty frames. However, since I knew this isnt the case for my missing file,
I felt confident in proceeding with this approach.
Next, I computed the same hash for every byte offest found by grep and compared it to the hashes found in the
previous step:
grep_log=’/mnt/storagebox/grep-log.txt’disk=’/dev/md2’print(“Parsing grep log”)positions=[]withopen(grep_log,’rb’)asf:forlineinf.read().split(b’\n’):ifnotline:# ignore empty line e.g. at end of file
continuepos,data=line.split(b’:’,1)pos=int(pos.decode())# add offset of FLV within line
binoffset=data.index(b”FLV\x01″)pos+=binoffsetpositions.append(pos)print(“Computing hashes of files on disk”)found_hashes={}withopen(disk,’rb’)asf:forpinpositions:f.seek(p)d=f.read(hashsize)h=hashlib.md5(d).hexdigest()ifhinknown_hashes:print(“At offset”,p,”found known hash”,h,”corresponding to”,known_hashes[h])else:print(“At offset”,p,”found unknown hash”,h)found_hashes[h]=punknown_hashes={h:pforh,pinfound_hashes.items()ifhnotinknown_hashes}files_not_found=[fnameforh,fnameinknown_hashes.items()ifhnotinfound_hashes]print(len(found_hashes),”found hashes,”,len(unknown_hashes),”of them unknown”)print(len(files_not_found),”files not found:”,files_not_found)
This produced 5 byte offsets with a checksum exactly what I expected. Four that really did not
correspond to a file, one corresponding to a file smaller than 500 kilobytes and therefore
with a different hash.
Now, all that was left to do was writing out the byte sequences of (at least) 1.6 GB starting
at the five possible byte offsets. Just to be safe, I exported 1.8 GB of each:
frommathimportfloorfromtqdmimporttqdmfromtqdm.autoimporttrangesize_missing_file=1.8*1024*1024*1024print(“Exporting files with unknown hashes:”)forh,pintqdm(unknown_hashes.items()):withopen(disk,’rb’)asfr:fr.seek(p)withopen(f’out_{h}.flv’,’wb’)asfw:foriintrange(floor(size_missing_file//1024)):fw.write(fr.read(1024))
(The usage of tqdm is optional, but gives you a nice progress bar.)
I then downloaded the five files, and indeed, the one with the highest position on disk contained
the video file I accidentally deleted. Except some very minor corruption of less than a second
somewhere in the video, the video was fully recovered. Phew.
In the long term, well of course work on preventing this from possibly happening again. Leaving
very specific solutions like alias mv=’mv -i’ aside, well obviously re-evaluate whether we need
to create separate backups of this data if our customers begin to rely on it more than we intended to,
and looking at possible video-on-demand features coming to Venueless, well at some point create
a fully automated video processing pipeline that removes the manual and error-prone steps from this
process.