commit 93f7170fffae046ba492c161c88a30ed77be2a08
parent bc61ec416abc5f083234f46496105ecb9828c9f2
Author: breadcat <peter@minskio.co.uk>
Date: Sun, 5 Jul 2020 18:00:47 +0100
Update to remove extra tags
Diffstat:
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/content/posts/formatting-dumped-subtitles.md b/content/posts/formatting-dumped-subtitles.md
@@ -1,6 +1,7 @@
---
title: "Formatting dumped subtitles into a vocabulary list"
date: 2020-05-28T16:52:00
+lastmod: 2020-07-05T17:59:00
tags : [ "Formats", "Languages", "Linux", "Media", "Snippets", "Software", ]
---
@@ -11,7 +12,7 @@ tr ' ' '\n' < subs.srt \
sed -e 's/<[^>]*>//g' \
tr '[:upper:]' '[:lower:]' \
tr -d '\>\/!-.:?,.\",[:digit:]' \
- sed -e '/^[[:space:]]*$/d' -re 's/\s+$//' \
+ sed -e '/^[[:space:]]*$/d' -re 's/\s+$//' -re 's/\{...\}//' \
sort -u > subs-sort.srt
```
@@ -19,4 +20,6 @@ In short, this will break all spaces into new lines, remove HTML tags, make ever
One issue I've noticed is some _special_ characters won't be converted to lowercase Å to å for example. I don't have an automated workaround for you aside from specifying the letters individually for example using:
-<pre><code>tr 'ÆØÅÄÖÐÞÁÉÍÓÚÝ' 'æøåäöðþáéíóúý'</pre></code>
-\ No newline at end of file
+<pre><code>tr 'ÆØÅÄÖÐÞÁÉÍÓÚÝ' 'æøåäöðþáéíóúý'</pre></code>
+
+* **Edit 2020-07-05:** Added {\an} tag removal