From 93f7170fffae046ba492c161c88a30ed77be2a08 Mon Sep 17 00:00:00 2001 From: breadcat Date: Sun, 5 Jul 2020 18:00:47 +0100 Subject: Update to remove extra tags --- content/posts/formatting-dumped-subtitles.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/content/posts/formatting-dumped-subtitles.md b/content/posts/formatting-dumped-subtitles.md index 4250b8d..1d00c4a 100644 --- a/content/posts/formatting-dumped-subtitles.md +++ b/content/posts/formatting-dumped-subtitles.md @@ -1,6 +1,7 @@ --- title: "Formatting dumped subtitles into a vocabulary list" date: 2020-05-28T16:52:00 +lastmod: 2020-07-05T17:59:00 tags : [ "Formats", "Languages", "Linux", "Media", "Snippets", "Software", ] --- @@ -11,7 +12,7 @@ tr ' ' '\n' < subs.srt \ sed -e 's/<[^>]*>//g' \ tr '[:upper:]' '[:lower:]' \ tr -d '\>\/!-.:?,.\",[:digit:]' \ - sed -e '/^[[:space:]]*$/d' -re 's/\s+$//' \ + sed -e '/^[[:space:]]*$/d' -re 's/\s+$//' -re 's/\{...\}//' \ sort -u > subs-sort.srt ``` @@ -19,4 +20,6 @@ In short, this will break all spaces into new lines, remove HTML tags, make ever One issue I've noticed is some _special_ characters won't be converted to lowercase Å to å for example. I don't have an automated workaround for you aside from specifying the letters individually for example using: -
tr 'ÆØÅÄÖÐÞÁÉÍÓÚÝ' 'æøåäöðþáéíóúý'
\ No newline at end of file +
tr 'ÆØÅÄÖÐÞÁÉÍÓÚÝ' 'æøåäöðþáéíóúý'
+ +* **Edit 2020-07-05:** Added {\an} tag removal -- cgit v1.2.3