Problem :
I have a source of a page and I need to get all tags from this file. Order is important. I need both external and inline scripts. tag must be included in the output. I’m looking for a console linux tool.
I tried searching but I couldn’t find anything, to the point I used jQuery to obtain this info and pasted it into a file. But this output has some strange encoding, so I need to parse it traditionally.
Example:
Input:
<html>
<head>
<script src="script1.js"></script>
<script src="script2.js"></script>
<script>alert('hello');</script>
</head>
<body>
<div id="main">...</div>
<script src="footer.js">
</body>
</html>
Output:
<script src="script1.js"></script>
<script src="script2.js"></script>
<script>alert('hello');</script>
<script src="footer.js">
Second example, output only src attibutes.
script1.js
script2.js
inline script
footer.js
Solution :
You can use grep
for that and its only-matching parameter (-o
), e.g.:
$ grep -o "<[^>]*>" <(curl -s http://example.com/)
This will print all the html tags including the order.
To include only <script>
tags, try (change index.html
with your file):
$ grep -Eo "<script.*(</script>|>)" index.html
For getting just the file names (from src
attribute), you can extend by adding another grep
, e.g.:
$ grep -Eo "<script.*(</script>|>)" index.html | grep -o '"[^"]*"' | tr -d '"'
Above syntax won’t help you with many different variations of html code, therefore for more complex solutions, using regex to parse html is in general not advised, therefore you should to use appropriate tools (language of your preference or check out these shell tools).