diff options
Diffstat (limited to 'content/posts/formatting-dumped-subtitles.md')
-rw-r--r-- | content/posts/formatting-dumped-subtitles.md | 7 |
1 files changed, 5 insertions, 2 deletions
diff --git a/content/posts/formatting-dumped-subtitles.md b/content/posts/formatting-dumped-subtitles.md index 4250b8d..1d00c4a 100644 --- a/content/posts/formatting-dumped-subtitles.md +++ b/content/posts/formatting-dumped-subtitles.md @@ -1,6 +1,7 @@ --- title: "Formatting dumped subtitles into a vocabulary list" date: 2020-05-28T16:52:00 +lastmod: 2020-07-05T17:59:00 tags : [ "Formats", "Languages", "Linux", "Media", "Snippets", "Software", ] --- @@ -11,7 +12,7 @@ tr ' ' '\n' < subs.srt \ sed -e 's/<[^>]*>//g' \ tr '[:upper:]' '[:lower:]' \ tr -d '\>\/!-.:?,.\",[:digit:]' \ - sed -e '/^[[:space:]]*$/d' -re 's/\s+$//' \ + sed -e '/^[[:space:]]*$/d' -re 's/\s+$//' -re 's/\{...\}//' \ sort -u > subs-sort.srt ``` @@ -19,4 +20,6 @@ In short, this will break all spaces into new lines, remove HTML tags, make ever One issue I've noticed is some _special_ characters won't be converted to lowercase Å to å for example. I don't have an automated workaround for you aside from specifying the letters individually for example using: -<pre><code>tr 'ÆØÅÄÖÐÞÁÉÍÓÚÝ' 'æøåäöðþáéíóúý'</pre></code>
\ No newline at end of file +<pre><code>tr 'ÆØÅÄÖÐÞÁÉÍÓÚÝ' 'æøåäöðþáéíóúý'</pre></code> + +* **Edit 2020-07-05:** Added {\an} tag removal |