Skip to main content

linux - How to bulk-rename files with invalid encoding or bulk-replace invalid encoded characters?


I have a debian server and I'm hosting music for an internet radio station. I have trouble with file names and paths because a lot of files got an invalid encoding, for example:


./music/Bändname - Some Title - additional Info/B�ndname - 07 - This Title Is Cörtain, The EncÃding Not.mp3

Ideally, I would like to remove everything that is not letters A-Z/a-z or numbers 0-9 or dash -/underscore _... The result should look like something like that:


./music/Bndname-SomeTitle-additionalInfo/Bndname-07-ThisTitleIsCrtain,TheEncdingNot.mp3

How to achieve this for a batch of a lot of files and directories?


I've seen this similar question: bulk rename (or correctly display) files with special characters


But this only fixes the encoding, I would prefer a more strict approach as described above.



Answer



You're going to run in some problems if you want to rename files and directories at the same time. Renaming just a file is easy enough. But you want to make sure the directories are also renamed. You can't simply mv Motörhead/Encöding Motorhead/Encoding since Motorhead won't exist at the time of the call.


So, we need a depth-first traversal of all files and folders, and then rename the current file or folder only. The following works with GNU find and Bash 4.2.42 on my OS X.


#!/usr/bin/env bash
find "$1" -depth -print0 | while IFS= read -r -d '' file; do
d="$( dirname "$file" )"
f="$( basename "$file" )"
new="${f//[^a-zA-Z0-9\/\._\-]/}"
if [ "$f" != "$new" ] # if equal, name is already clean, so leave alone
then
if [ -e "$d/$new" ]
then
echo "Notice: \"$new\" and \"$f\" both exist in "$d":"
ls -ld "$d/$new" "$d/$f"
else
echo mv "$file" "$d/$new" # remove "echo" to actually rename things
fi
fi
done

You may change the regex by using new="${f//[\\\/\:\*\?\"<>|]/}" if you want to replace anything that Windows cannot handle.


Save this script as rename.sh, make it executable with chmod +x rename.sh. Then, call it like rename.sh /some/path.


Make sure to resolve any file name collisions (“Notice” announcements).


If you're absolutely sure it does the right replacements, remove the echo from the script to actually rename things instead of just printing what it does.


To be safe, I'd recommend testing this on a small subset of files first.




Options explained


To explain what goes on here:



  • -depth will ensure directories are recursed depth-first, so we can "roll up" everything from the end. Usually, find traverses differently (but not breadth-first).

  • -print0 ensures the find output is null-delimited, so we can read it with read -d '' into the file variable. Doing so helps us deal with all kinds of weird file names, including ones with spaces, and even newlines.

  • We'll get the directory of the file with dirname. Don't forget to always quote your variables properly, otherwise any path with spaces or globbing characters would break this script.

  • We'll get the actual filename (or directory name) with basename.

  • Then, we remove any invalid character from $f using Bash's string replacement capabilities. Invalid means anything that's not a lower- or uppercase letter, a digit, a slash (\/), a dot (\.), an underscore, or a minus-hyphen.

  • If $f is already clean (the cleaned name is identical to the current name), skip it.

  • If $new already exists in directory $d (e.g., you have files named resume and résumé in the same directory), issue a warning. You don't want to rename it, because, on some systems, mv foo foo causes a problem.  Otherwise,

  • We finally rename the original file (or directory) to its new name


Since this will only act on the deepest hierarchy, renaming Motörhead/Encöding to Motorhead/Encoding is done in two steps:



  1. mv Motörhead/Encöding Motörhead/Encoding

  2. mv Motörhead Motorhead


This ensures all replacements are done in the correct order.




Example files and test run


Let's assume some files in a base folder called test:


test
test/Motörhead
test/Motörhead/anöther_file.mp3
test/Motörhead/Encöding
test/Randöm
test/Täst
test/Täst/Töst
test/with space
test/with-hyphen.txt
test/work
test/work/resume
test/work/résumé
test/work/schedule

Here is the output from a run in debug mode (with the echo in front of the mv), i.e., the commands that would be called, and the collision warnings:


mv test/Motörhead/anöther_file.mp3 test/Motörhead/another_file.mp3
mv test/Motörhead/Encöding test/Motörhead/Encoding
mv test/Motörhead test/Motorhead
mv test/Randöm test/Random
mv test/Täst/Töst test/Täst/Tost
mv test/Täst test/Tast
mv test/with space test/withspace
Notice: "resume" and "résumé" both exist in test/work:
-rw-r—r-- … … test/work/resume
-rw-r—r-- … … test/work/résumé

Notice the absence of messages for with-hyphen.txt, schedule, and test itself.


Comments

Popular Posts

How do I transmit a single hexadecimal value serial data in PuTTY using an Alt code?

I am trying to sent a specific hexadecimal value across a serial COM port using PuTTY. Specifically, I want to send the hex codes 9C, B6, FC, and 8B. I have looked up the Alt codes for these and they are 156, 182, 252, and 139 respectively. However, whenever I input the Alt codes, a preceding hex value of C2 is sent before 9C, B6, and 8B so the values that are sent are C2 9C, C2 B6, and C2 8B. The value for FC is changed to C3 FC. Why are these values being placed before the hex value and why is FC being changed altogether? To me, it seems like there is a problem internally converting the Alt code to hex. Is there a way to directly input hex values without using Alt codes in PuTTY? Answer What you're seeing is just ordinary text character set conversion. As far as PuTTY is concerned, you are typing (and reading) text , not raw binary data, therefore it has to convert the text to bytes in whatever configured character set before sending it over the wire. In other words, when y...

linux - Extract/save a mail attachment using bash

Using normal bash tools (ie, built-ins or commonly-available command-line tools), is it possible, and how to extract/save attachments on emails? For example, say I have a nightly report which arrives via email but is a zip archive of several log files. I want to save all those zips into a backup directory. How would I accomplish that? Answer If you're aiming for portability, beware that there are several different versions of mail(1) and mailx(1) . There's a POSIX mailx command, but with very few requirements. And none of the implementations I have seem to parse attachments anyway. You might have the mpack package . Its munpack command saves all parts of a MIME message into separate files, then all you have to do is save the interesting parts and clean up the rest. There's also metamail . An equivalent of munpack is metamail -wy .

ubuntu - Why does my USB hdd returns SG_IO: bad/missing sense data?

I am able to boot and run commands from external USB hdd; the message in question appears for about 45 seconds then booting continues. GRUB2 is installed on internal HDD. When choosing to boot directly to /dev/sdb the message doesn't appear, however boot time is about the same as booting to internal HDD. /dev/sdb: Timing cached reads: 1018 MB in 2.00 seconds = 508.97 MB/sec Timing buffered disk reads: 80 MB in 3.03 seconds = 26.37 MB/sec pfeiffep@de:~$ sudo hdparm -i /dev/sdb /dev/sdb: SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 10 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 HDIO_GET_IDENTITY failed: Invalid argument Gparted correctly identifies the drive as SAMSUNG MP0402H. Any ideas how to remedy the HDIO & SG_IO messages?

Desktop reboots itself on sleep or hibernate

I have been using an ASUS M2NPV-VM motherboard for main home desktop workstation, operating Windows Vista x64. This computer has right from day one not been able to enter hibernate or standby; after Windows performs its final actions and brings the machine down, it would automatically revive itself for a reboot. Updating to the second latest BIOS (1201)has not helped (the latest BIOS revision would induce video refresh problems rendering it unusable). I have been reading related discussions on incidents similar to mine to no avail of a true workable solution. They appear to be more speculative guesses rather than actual knowledge on the inner workings of motherboard hardware. Does anybody have any electronic engineering experience on PC energy-saving standards to provide a more informed opinion how to go about getting this to work? More stories: this motherboard could not even reboot properly the first thing i used it. It was due to refresh rate of the onboard GPU, which had no influe...