blog.minskio.co.uk

Content and theme behind minskio.co.uk
Log | Files | Refs

commit 93f7170fffae046ba492c161c88a30ed77be2a08
parent bc61ec416abc5f083234f46496105ecb9828c9f2
Author: breadcat <peter@minskio.co.uk>
Date:   Sun,  5 Jul 2020 18:00:47 +0100

Update to remove extra tags

Diffstat:
Mcontent/posts/formatting-dumped-subtitles.md | 8+++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/content/posts/formatting-dumped-subtitles.md b/content/posts/formatting-dumped-subtitles.md @@ -1,6 +1,7 @@ --- title: "Formatting dumped subtitles into a vocabulary list" date: 2020-05-28T16:52:00 +lastmod: 2020-07-05T17:59:00 tags : [ "Formats", "Languages", "Linux", "Media", "Snippets", "Software", ] --- @@ -11,7 +12,7 @@ tr ' ' '\n' < subs.srt \ sed -e 's/<[^>]*>//g' \ tr '[:upper:]' '[:lower:]' \ tr -d '\>\/!-.:?,.\",[:digit:]' \ - sed -e '/^[[:space:]]*$/d' -re 's/\s+$//' \ + sed -e '/^[[:space:]]*$/d' -re 's/\s+$//' -re 's/\{...\}//' \ sort -u > subs-sort.srt ``` @@ -19,4 +20,6 @@ In short, this will break all spaces into new lines, remove HTML tags, make ever One issue I've noticed is some _special_ characters won't be converted to lowercase &Aring; to &aring; for example. I don't have an automated workaround for you aside from specifying the letters individually for example using: -<pre><code>tr '&AElig;&Oslash;&Aring;&Auml;&Ouml;&ETH;&THORN;&Aacute;&Eacute;&Iacute;&Oacute;&Uacute;&Yacute;' '&aelig;&oslash;&aring;&auml;&ouml;&eth;&thorn;&aacute;&eacute;&iacute;&oacute;&uacute;&yacute;'</pre></code> -\ No newline at end of file +<pre><code>tr '&AElig;&Oslash;&Aring;&Auml;&Ouml;&ETH;&THORN;&Aacute;&Eacute;&Iacute;&Oacute;&Uacute;&Yacute;' '&aelig;&oslash;&aring;&auml;&ouml;&eth;&thorn;&aacute;&eacute;&iacute;&oacute;&uacute;&yacute;'</pre></code> + +* **Edit 2020-07-05:** Added {\an} tag removal